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Preface 


This  volume  contains  the  66  technical  papers  presented  at  the  Fifteenth  International  Conference  on  Machine 
Learning  (ICML  '98),  held  July  24r-27,  1998,  in  Madison,  Wisconsin  U.S.A.  These  articles  were  selected  based 
on  the  rigorous  review  and  discussion  of  215  submissions.  All  papers  were  presented  orally  as  well  as  at  an 
evening  poster  session. 

ICML  '98  was  one  of  ten  Al-related  conferences  held  in  Madison  during  mid-summer  1998,  in  an  ambitious, 
first-time  experiment  to  see  what  kind  of  synergies  would  result  from  aU  this  collocation.  In  particular,  ICML 
'98  was  held  in  the  same  building  as,  and  concurrently  with,  the  Computational  Learning  Theory  (COLT) 
and  Uncertainty  in  Artificial  Intelligence  (UAI)  conferences.  It  also  overlapped  one  day  with  the  Inductive 
Logic  Programming  (ILP)  conference. 

Registrants  were  allowed  to  attend,  without  additional  costs,  the  technical  sessions  of  the  other  conferences. 
COLT,  ICML,  and  UAI  each  invited  one  of  the  plenary  speakers  and  jointly  invited  the  banquet  speaker,  plus 
there  was  a  joint  poster  session  and  wrapup  panel.  TTie  poster  session  contained  about  150  papers,  allowing 
conference  attendees  a  chance  to  see  or  further  discuss  research  presented  during  the  four  to  five  parallel 
tracks  and  fostering  interaction  among  the  various  commimities. 

I  especially  wish  to  thank: 

•  the  authors  of  all  the  papers  for  their  technical  contributions  toward  the  advancement  of  machine  learn¬ 
ing. 

•  Richard  Sutton,  who  was  the  IMCL  representative  among  the  three  joint  invited  speakers  and  spoke  on 
"Reinforcement  Learning:  How  Far  Can  It  Go?";  Ron  Kohavi,  who  was  the  second  ICML  invited  speaker, 
discussing  "Crossing  the  Chasm:  From  Academic  Machine  Learning  to  Commercial  Data  Mining";  and 
David  Spiegelhalter,  who  was  the  jointly  invited  banquet  speaker,  describing  "2.5  Millennia  of  Directed 
Graphs." 

•  the  advisory  conunittee  for  their  suggestions  regarding  the  program  committee  and  the  invited  speakers. 

•  the  program  committee  for  their  efforts  initially  reviewing  about  a  dozen  submissions  each  and  then  par¬ 
ticipating  in  in-depth  discussions  regarding  which  should  be  accepted. 

•  the  organizers  of  the  COLT,  UAI,  and  ILP  conferences  for  all  their  efforts  spent  coordinating  our  collocated 
meetings,  and  aU  the  invited  speakers  and  technical-paper  presenters  that  these  communities  provided. 

•  Carol  Hamilton  of  the  American  Association  of  Artificial  Intelligence  (AAAI)  for  her  invaluable  help  coor¬ 
dinating  AAAI  '98  with  ICML  '98,  and  to  AAAI  in  general  for  publicizing  ICML  '98,  for  processing  ICML's 
advance  registrations,  for  the  organization  of  several  ML-related  workshops  and  tutorials,  and  for  the  use 
of  the  Madison  convention  center  for  some  of  the  ICML  '98  events. 

•  those  sponsors  (listed  below)  who  provided  financial  support,  allowing  for  reduced  registrations  fees  and 
providing  partial  travel  support  to  some  needy  graduate  students  and  the  invited  speakers. 

•  the  organizers  of  the  joint  AAAI/ICML  workshops  (listed  below)  and  to  the  presenters  of  ML-related 
AAAI  tutorials. 
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•  Jacque  Girard  and  Patricia  Danek  of  the  University  of  Wisconsin  Business  School,  where  the  bulk  of  the 
conference's  events  were  held,  and  to  Maureen  Sundell  of  the  Wisconsin  Office  of  Conference  Services  for 
their  excellent  help  with  local  arrangements. 

•  Bradley  Schwarzhoff  for  ably  serving  as  a  conference  assistant,  Laura  Cuccia  for  secretarial  help,  and 
Sheila  Beattie,  Virginia  Werner,  Marie  Johnson,  Margaret  Roth,  and  Benjamin  Griffiths  for  processing  much 
paperwork  related  to  financial  and  legal  matters. 

•  all  the  student  volunteers  who  served  before  and  during  the  conference,  especially  Tina  Eliassi-Rad,  Daniel 
Shiovitz,  Chongmeng  Chow,  and  Carolyn  Allex. 

•  Morgan  Kaufmann  Publishers  for  distributing  the  volume;  to  Professional  Book  Center  for  producing  it, 
and  offering  ICML  contributors  an  experimental  option  to  deliver  their  finished  papers  as  PostScript  files 
via  the  Internet  (more  than  half  of  them  did  so);  and  to  Steve  Reiter  of  the  United  States  Geological  Service 
for  his  extraordinary  help  in  locating  and  obtaining  the  cover  image. 
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Abstract 

We  propose  new  query  learning  strategies  by 
combining  the  idea  of  query  by  committee 
and  that  of  boosting  [Sch90,  FS95]  and  bag¬ 
ging  [Bre94].  Query  by  committee  is  a  query 
learning  strategy  which  makes  use  of  a  ran¬ 
domized  component  learning  algorithm  and 
works  by  querying  the  function  value  of  a 
point  at  which  the  predictions  made  by  many 
copies  of  the  component  algorithm  are  max¬ 
imally  spread.  The  requirement  of  query  by 
committee  on  the  component  algorithm  that 
it  be  an  ideal  randomized  algorithm  makes 
it  hard  to  apply  in  practice  when  we  have 
only  a  moderately  performing  deterministic 
algorithm.  To  address  this  issue,  we  bor¬ 
row  the  ideas  of  boosting  and  bagging,  which 
are  both  techniques  to  enhance  the  perfor¬ 
mance  of  an  existing  learning  algorithm  by 
running  it  many  times  on  a  set  of  re-sampled 
data  and  combining  the  output  hypotheses 
to  make  a  prediction  by  (weighted)  majority 
voting.  We  propose  two  query  learning  meth¬ 
ods,  query  by  bagging  and  query  by  boosting, 
which  select  the  next  query  point  by  picking 
a  point  on  which  the  (weighted)  majority  vot¬ 
ing  by  the  obtained  hypotheses  has  the  least 
margin.  We  empirically  evaluate  the  perfor¬ 
mance  of  these  methods  on  a  wide  range  of 
real  world  data.  Our  experiments  show  that, 
when  using  C4.5  as  the  component  learning 
algorithm  and  run  on  data  sets  in  UCI  Ma¬ 
chine  Learning  repository,  both  query  learn¬ 
ing  methods  significantly  improve  data  effi¬ 
ciency  as  compared  to  both  C4.5  itself  and 
boosting  applied  on  C4.5.  A  typical  increase 
in  data  efficiency  achieved  was  2  to  4-fold. 
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1  Introduction 

Query  learning  is  a  sub-area  of  machine  learning  at¬ 
tracting  increasing  attention  both  in  theory  and  in 
practice  with  the  expectation  that  it  may  bring  down 
both  computational  and  sample  complexities  that 
plague  passive  learners,  (c.f.  [LC94,  CS94,  LG94]) 
For  example,  there  is  a  rich  body  of  work  on  the  algo¬ 
rithmic  approach  to  query  learning  as  initiated  by  An- 
gluin’s  query  learning  model  [Ang87].  Another  promis¬ 
ing  approach  is  the  Bayesian  or  information  theoretic 
approach  to  query  learning  [PK95,  SOS92],  in  which  a 
query  learner  tries  to  maximize  the  information  gain 
on  each  query.  Of  the  latter  approach,  ‘query  by  com¬ 
mittee’  [SOS92]  is  an  especially  attractive  and  general 
query  learning  strategy  with  theoretical  performance 
guarantee.  In  the  present  paper,  we  propose  new  vari¬ 
ants  of  query  by  committee,  which  we  call  ‘query  by 
boosting’  and  ‘query  by  bagging,’  by  combining  query 
by  committee  with  the  techniques  of  boosting  and  bag¬ 
ging- 

‘Query  by  committee’  [SOS92]  is  a  query  learning 
strategy  which  makes  use  of  many  copies  of  an  ideal 
randomized  learning  algorithm.  More  concretely,  it 
uses  a  number  of  copies  of  Gibbs  algorithm  (a  random¬ 
ized  algorithm  that  picks  a  hypothesis  from  a  given  hy¬ 
pothesis  class  according  to  the  posterior  distribution 
and  predicts  according  to  it)  and  queries  the  func¬ 
tion  value  of  a  point  at  which  their  predictions  are 
maximally  spread.  The  idea  is  that,  by  choosing  a 
query  point  with  maximum  uncertainty  of  estimation 
of  its  function  value,  the  information  gain  can  be  max¬ 
imized.  Indeed,  there  is  a  theoretical  guarantee  of  the 
near-optimality  of  the  data  efficiency  of  this  method, 
but  it  is  based  on  the  assumption  that  the  component 
learning  algorithm  is  Gibbs  algorithm.  This  assump¬ 
tion  poses  two  problems  when  one  tries  to  apply  this 
technique  in  practice:  One  is  the  problem  of  computa¬ 
tional  complexity,  because  Gibbs  algorithms  for  inter¬ 
esting  hypothesis  classes  tend  to  be  computationally 
intractable.  The  other  is  that  it  cannot  be  applied  on 
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a  deterministic  component  learning  algorithm.  The 
two  methods  we  propose  in  the  present  paper,  ‘query 
by  boosting’  and  ‘query  by  bagging,’  are  motivated  to 
address  these  two  issues. 

‘Boosting’  and  ‘bagging’  are  both  techniques  to  en¬ 
hance  the  performance  of  an  existing  learning  algo¬ 
rithm  by  running  it  many  times  on  a  set  of  re-sampled 
data  and  combining  the  output  hypotheses  to  make  a 
prediction.  Bagging,  due  to  Brieman  [Bre94],  is  the 
simpler  of  the  two,  and  it  works  by  re-sampling  from 
the  input  data  with  the  same  (uniform)  distribution 
and  its  final  hypothesis  is  obtained  by  taking  major¬ 
ity  vote  over  the  predictions  of  the  output  hypotheses. 
Boosting'  [Sch90,  FS95]  is  a  more  complicated  method 
that  can  be  used  to  boost  the  performance  of  a  rela¬ 
tively  weak  learning  algorithm  by  use  of  sophisticated 
re-sampling  on  the  training  data.  It  does  so  by  repeat¬ 
edly  re-sampling  on  the  input  training  data,  w’ith  the 
sampling  distribution  varied  each  time  so  as  to  focus 
more  and  more  on  the  part  of  the  training  data  on 
which  the  previously  obtained  hypotheses  did  poorly 
on.  The  final  prediction  of  boosting  is  made  by  taking 
a  weighted  majority  (or  average)  of  the  predictions  of 
all  the  hypotheses  thus  obtained. 

As  noted  earlier,  one  of  the  weakness  of  query  by  corn- 
mittee  is  that  it  cannot  be  applied  on  a  deterministic 
component  algorithm.  If  the  component  learning  al¬ 
gorithm  we  have  available  is  deterministic,  the  idea  of 
bagging  offers  a  natural  alternative;  namely  apply  bag¬ 
ging  to  obtain  a  set  of  hypotheses,  let  these  hypothe¬ 
ses  predict  on  a  set  of  candidate  points,  and  pick  the 
point  on  which  the  predictions  have  the  largest  vari¬ 
ance.  When  making  a  prediction,  predict  by  majority 
vote  over  all  the  hypotheses.  Since  query  by  bagging 
introduces  randomness  in  the  form  of  re-sampling  from 
the  input  data,  it  can  be  used  on  a  component  algo¬ 
rithm  that  is  deterministic. 

When  the  learning  problem  of  interest  is  sufficiently 
complex,  efficient  implementation  of  Gibbs  algorithm 
is  not  possible.  If  such  is  the  case  and  the  best  known 
learning  algorithm  does  not  have  a  very  good  perfor¬ 
mance,  then  it  makes  sense  to  use  boosting  to  enhance 
its  performance.  Recall  that  the  most  notable  charac¬ 
teristic  of  boosting  is  its  tolerance  on  the  performance 
of  the  component  learning  algorithm.  Thus,  appro¬ 
priately  combining  the  idea  of  boosting  and  query  by 
committee,  we  may  obtain  a  query  learning  method 
that  is  tolerant  on  the  performance  of  the  component 
learning  algorithm. 

Recent  experimentation  using  boosting  has  shown  a  re¬ 
markable  fact  (e.g.  [DSS92])  that  even  after  boosting 

'Boosting  was  first  discovered  by  Schapire  [Sch90]  in 
the  context  of  proving  the  equivalence  of  ‘weak  learnability’ 
with  the  strong  PAG  learnability.  It  was  subsequently  im¬ 
proved  by  Freund  [Fre90],  and  Freund  and  Schapire  [FS95]. 


has  achieved  perfect  prediction  on  the  training  data, 
it  keeps  boosting  its  predictive  performance  on  unseen 
data.  This  seemingly  contradicts  known  facts  about 
over-learning,  but  recently  Schapire  et  al  [SFBL97] 
have  given  an  account  of  this  fact.  That  is,  even  after 
realizing  perfect  predictive  performance  on  the  train¬ 
ing  data,  boosting  keeps  increasing  its  confidence  of 
prediction,  or  more  specifically  the  difference  between 
the  total  weight  a.ssigned  to  the  correct  prediction  and 
that  assigned  to  a  wrong  prediction.  (This  is  called 
the  ‘margin’  of  the  prediction.)  In  their  paper,  they 
prove  that  a  hypothesis  having  a  larger  margin  on 
the  training  data  performs  better  on  unseen  data  as 
well.  Based  on  this  observation,  the  method  we  pro¬ 
pose  here,  query  by  boosting,  selects  as  the  next  query 
a  point  on  which  the  margin  obtained  by  the  boosting 
algorithm  is  minimum,  and  attempts  to  maximize  the 
uncertainty  of  prediction  and  hence  the  information 
gain  on  each  query. 

We  conducted  experiments  using  real  world  data  to 
evaluate  the  performance  of  the  proposed  query  learn¬ 
ing  methods.  In  particular,  we  tested  them  on  a  large 
part  of  the  UCI  Machine  Learning  data  repository,  us¬ 
ing  Quinlan’s  C4.5  as  the  component  algorithm.  Here 
we  note  that  testing  query  learning  algorithms  on  these 
databases  is  not  possible  in  a  strict  sense,  since  not 
all  the  query  points  can  be  answered.  We  therefore 
used  our  query  strategies  as  methods  of  selective  sam¬ 
pling  to  pick  more  informative  queries  from  a  fixed 
set  of  training  data.  (c.f.  [LG94])  On  almost  all  the 
data  sets  we  tested  these  learning  methods,  both  query 
by  boosting  and  query  by  bagging  achieved  significant 
increase  in  data  efficiency  as  compared  to  both  C4.5 
and  boosting  applied  on  C4.5.  The  increase  in  data 
efficiency  measured  by  the  data  size  required  by  the 
query'  learning  methods  to  reach  the  same  accuiracy 
achieved  by  C4.5  (near  the  end  of  the  data  set)  was 
anywhere  from  2  to  5-fold.  As  compared  to  boosting 
applied  on  C4.5,  the  increase  in  data  efficiency  of  the 
query  methods  was  2  to  4-fold  on  most  data  sets. 

On  one  of  the  eight  data  sets  above,  tic-tac-toe,  we 
ran  analogous  experiments  using  a  different  compo¬ 
nent  learning  algorithm  ~  a  randomized  version  of  a 
w'eighted  majority  prediction  algorithm  for  learning  n- 
ary  relations  proposed  in  [ALN95]  called  WMPl.  In 
addition  to  the  two  query  methods,  we  also  tested  the 
original  query  by  committee  method,  as  the  compo¬ 
nent  algorithm  is  now  randomized.  It  was  found  that, 
with  randomized  WMPl  as  the  component  algorithm, 
both  query  by  boosting  and  qiiery  by  bagging  per¬ 
formed  better  than  query  by  committee. 
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Algorithm:  Query-by-  Commit  tee  ( Q  B  C ) 
Notation:  In  general,  we  use  5-  to  denote  the 
unlabeled  sample  corresponding  to  5. 

Input:  Number  of  trials:  N 
Randomized  component  learning  algorithm:  A 
Number  of  times  A  is  called:  T 
Number  of  query  candidates:  R 
A  set  of  query  points:  Q 
Initialization:  Si  =  (xi,f(xi))  for  random 
For  i  =  1, TV 

1.  Run  A  on  5,  T  times  to  obtain  hi,...,hT. 

2.  Randomly  generate  a  set  of  R  points  C  CQ\S- 
with  respect  to  uniform  distribution  over  (?  \  51. 

3.  Pick  a  point  x*  e  C  split  most  evenly:  x*  =  arg 
min^ec \\{t  <  T\hi{x)  =  1}|  -  |{t  <  T\ht{x)  =  0}|| 

4.  Query  the  function  value  at  x*  and  obtain  f(x*). 

5.  Update  the  past  data  as  follows 
5j+i  =  append{Si,  {x*,f{x*))) 

End  For 

Output:  Output  as  the  final  hypothesis: 
hfinix)  =  argmaxygy  |{t  <  T\ht{x)  =  2/}| 
where  hf  are  hypotheses  of  the  final  (TV-th)  stage 

Figure  1:  Query  by  Committee  (QBC) 

2  Query  Learning  Methods 

2.1  Query  by  Committee 

We  briefly  describe  the  original  query  by  committee 
method,  generalized  to  use  an  arbitrary  randomized 
component  algorithm.  At  any  point  in  time,  query  by 
committee  runs  the  component  algorithm  on  the  past 
data  a  number  of  times  to  obtain  many  hypotheses. 
It  picks  the  next  query  point  by  choosing  from  among 
a  set  of  randomly  generated  candidate  points  a  point 
such  that  the  predictions  by  the  hypotheses  are  split 
most  evenly.  The  details  are  given  in  Figure  1.  Here, 
if  Q  is  a  pre-determined  set  of  points  on  which  the 
function  values  can  be  obtained,  then  the  algorithm 
as  described  is  a  method  of  selective  sampling.  If,  on 
the  other  hand,  Q  is  set  to  the  entire  domain,  then  it 
is  a  genuine  query  learning  algorithm,  which  is  free  to 
choose  any  point  in  the  domain  as  a  query  point. 

2.2  Query  by  Bagging 

‘Bagging’[Bre94]  re-samples  from  the  input  sample 
with  a  fixed  distribution,  and  the  final  hypothesis  is 
obtained  by  averaging  the  outputs  of  the  hypotheses 
thus  obtained.  This  method  is  based  on  the  idea  that 
prediction  error  consists  of  the  ‘bias,’  which  is  the  es¬ 
timation  error  necessitated  by  the  input  data  size,  and 
the  ‘variance’  which  is  due  to  the  statistical  variation 
existing  in  the  specific  data.  The  claim  is  that  bag¬ 
ging  can  isolate  the  two  factors  and  can  minimize  the 


Algorithm:  Query-by-Bagging(QBag) 

Input:  Number  of  trials:  TV 
Component  learning  algorithm:  A 
Number  of  times  re-sampling  is  done:  T 
Number  of  query  candidates:  R 
A  set  of  query  points:  Q 
Initialization:  5i  =  {xi,f(xi))  for  random  xi 
For  i  —  1, ...,  TV 

1 .  By  resampling  according  to  uniform  distribution 
on  Si,  obtain  sub-samples  5^,  ..,5y  each  of  size  m. 

2.  Run  A  on  each  sub-sample  and  obtain  hi, ...,  hx- 

3.  Randomly  generate  a  set  of  R  points  C  CQ\  S-. 
with  respect  to  uniform  distribution  over  Q\S-. 

4.  Pick  a  point  x*  e  C  split  most  evenly:  x*  =  arg 
min,,gc \\{t  <  T\ht{x)  =  1}|  -  |{t  <  T\ht{x)  =  0}|| 

5.  Query  the  function  value  at  x*  and  obtain  f{x*). 

6.  Update  the  past  data  as  follows 
5i+i  =  append{Si,  {x*,f{x*)}) 

End  For 

Output:  Output  as  the  final  hypothesis: 
hfinix)  =  argmaxygy  |{t  <  T\ht{x)  =  j/}| 
where  h*  are  hypotheses  of  the  final  (TV-th)  stage 

Figure  2:  Query  by  Bagging  (QBag) 


variance  component  of  the  error.  Query  by  bagging  is 
like  query  by  committee,  except  it  applies  bagging  on 
the  input  sample  and  picks  as  the  next  query  point  a 
point  at  which  the  predictions  of  the  hypotheses  are 
most  evenly  split.  The  details  of  query  by  bagging  are 
also  given  in  Figure  2. 

2.3  Query  by  Boosting 

We  will  now  describe  the  query  by  boosting  method 
in  detail.  In  query  by  boosting,  we  pick  as  the  next 
query  point  a  point  at  which  the  weighted  voting  of 
the  final  hypothesis  obtained  by  boosting  the  compo¬ 
nent  learning  algorithm  has  the  least  ‘margin.’  When 
the  target  function  is  0,1-valued,  this  means  that  the 
query  point  is  one  for  which  the  difference  between  the 
total  weight  for  the  value  1  and  that  for  0  is  minimum 
among  all  candidate  points.  We  give  the  details  of  this 
procedure  in  Figure  3,  where  we  also  supply  the  details 
of  AdaBoost  [FS95]  for  completeness. 

Note  that  the  original  query  by  committee,  query  by 
bagging,  and  query  by  boosting  form  a  natural  pro¬ 
gression.  In  query  by  committee,  all  the  samples  are 
identical,  and  the  variance  of  the  component  algo¬ 
rithm’s  predictions  is  taken  with  respect  to  the  ran¬ 
domness  that  exists  within  the  component  algorithm. 
In  query  by  bagging,  subsamples  are  obtained  from 
the  input  sample  using  an  identical  distribution,  and 
the  variance  of  the  component  algorithm’s  predictions 
is  with  respect  to  the  randomness  in  re-sampling.  In 
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Algorithm:  Query-by-Boosting(QBoost) 

Input:  Number  of  trials:  N 
Component  learning  algorithm:  A 
Number  of  times  re-sampling  is  done:  T 
Number  of  query  candidates:  R 
A  set  of  query  points:  Q 

Initialization:  Si  =  (xi,/(xi))  for  random  Xi 
For  i  =  1, N 

1.  Run  AdaBoost  on  input  (S,-,  A,  T)  and  get: 

=  argmaxyey  Eh,(x)=y 'og  i 

2.  Randomly  generate  a  set  of  R  points  C  C  <?  \  S;. 
with  respect  to  uniform  distribution  over  Q\S[. 

3.  Pick  a  point  x*  e  C  with  the  minimum  margin: 

X*  =  arg  min^^gc  I  Eh,(r)=o  'og  j;  -  Ea,(*)=i 

4.  Query  the  function  value  at  x*  and  obtaiu  /(**)• 

5.  Update  the  past  data  as  follows 
Si+i  =  append{Si,{x*  ,f{x*))) 

End  For 

Output:  Output  hfin  in  the  last  stage  as  the  output. 


Subroutine:  AdaBoost  [FS95] 

Input:  Sample:  S  =  ({xi,yi),  ■■,{xi,yi),  ..,{xm,ym)) 
(Here,  assume  Vj/i  €  F  =  {0, 1}.) 

Component  learning  algorithm:  A 
Number  of  times  re-sampling  is  done:  T 
Initialization:  Vi  < 

For  t  =  1, T 

1.  Run  A  on  a  sample  of  size  m  generated  w.r.t.  Dt- 

2.  Let  its  output  hypothesis  be  /i(. 

3.  Compute  its  error  rate  Ct  by: 

4.  Calculate  by  /?*  = 

5.  Update  the  re-sampling  distribution  Dt  +  i'- 

Dt+i{xi)  =  if  ht{x,)  =  y, 

Dt+ii^i)  =  Dt{xi)  otherwise 

(Here  Z  is  a  normalization  constant  satisfying 

Ei  =  l,..,m  +  “  ^•) 

Output:  Output  as  the  final  hypothesis: 
hjin{x)  =  argmaxj^gy  X;h,(x)=yloS^ 

Figure  3:  Query  by  boosting  (QBoost) 


name 

#  ex. 

#  attributes 
disc.  cont. 

missing 

values 

liver-disorders 

345 

6 

- 

ionosphere 

351 

34 

- 

house-votes-84 

435 

16 

0 

wdbc 

569 

32 

- 

crx 

690 

9  6 

0 

breast-cancer- Wisconsin 

699 

9 

0 

pima-indians-diabetes 

768 

8 

- 

tic-tac-toe 

958 

9 

- 

Table  1:  The  eight  data  sets  used  in  our  experiments, 

query  by  boosting,  the  re-sampling  distribution  itself 
is  changed  depending  on  the  properties  of  the  obtained 
hypotheses,  and  the  variance  of  the  component  algo¬ 
rithm’s  predictions  is  measured  with  respect  to  the 
uncertainty  involved  in  weighted  voting  by  the  various 
hypotheses. 

3  Experimental  procedures 

We  evaluate  the  proposed  query  learning  methods  on 
the  learning  problem  for  concepts  (or  0,1-valucd  func¬ 
tions)  over  a  number  of  attributes,  which  arc  either 
binary,  discrete  or  numerical.  A  special  case  of  this 
is  when  all  the  attributes  are  discrete,  and  the  target 
function  can  be  regarded  as  an  n-ary  relation  over  n 
finite  sets.  In  our  experiments,  we  use  existing  data 
sets  for  training  and  test  data,  without  an  explicitly 
defined  target  function.  Since  it  is  not  possible  to  use 
query  learning  algorithms  genuinely  as  query  learners 
in  this  setting,  we  use  them  as  methods  for  selective 
sampling,  that  is,  ways  to  select  a  smaller  set  of  more 
effective  data  from  a  large  data  set. 

The  data  sets  we  used  in  our  experiments  were  bor¬ 
rowed  from  the  machine  learning  data  repository  of 
University  of  California  at  Irvine.'^  Of  the  large  num¬ 
ber  of  data  sets  available  from  the  repository,  we  se¬ 
lected  8  (not  all)  data  sets  satisfying  the  following  con¬ 
ditions:  (1)  The  target  function  is  0,1-valucd;  (2)  The 
data  size  is  moderate  (more  than  300  and  less  than 
1,000);  Table  1  summarizes  the  data  sets  wc  selected 
and  their  basic  characteristics. 

On  these  data  sets,  we  compared  the  performance  of 
C4.5,  boosting  applied  on  C4.5,  query  by  boosting  ap¬ 
plied  on  C4.5,  and  query  by  bagging  applied  on  C4.5. 
For  each  data  set,  we  performed  10-fold  cross  valida¬ 
tion,  with  one-tenth  of  the  available  data  (selected  ran¬ 
domly)  reserved  as  the  test  data  and  the  rest  used  as 
the  training  data,  or  query  data.  For  each  of  the  10 

^This  data  set,  abbreviated  as  the  ICI  ML  repos¬ 
itory  in  what  follows,  is  available  at  URL  address: 
“http:  /  /www.ics.uci.edu  / '  mlearn/MLRepositoryhtinl” 
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pairs  of  training  and  test  data  sets,  we  averaged  the 
results  over  two  randomized  runs,  a  total  of  20  runs 
for  each  data  set.^ 

The  query  learning  algorithms  are  used  to  pick  the 
next  query  point  from  the  training  (query)  data  with¬ 
out  replacement  and  are  tested  using  the  (separate) 
test  data.  When  the  specified  number  of  candidates 
exceeded  what  is  left  of  the  training  data,  we  went 
on  with  as  many  candidates  as  there  were  left.  On 
one  occasion,  we  also  examined  their  predictive  perfor¬ 
mance  on  the  query  data,  from  which  query  learners 
have  selected  a  subset  to  learn  from,  instead  of  using 
the  separate  test  data. 

Finally,  the  parameters  T  and  R  in  all  the  query  learn¬ 
ing  methods  were  set  at  T  =  20  and  R  =  100  in  all  of 
our  experiments. 

4  Experimental  Results 

We  now  discuss  the  results  of  our  experiments  on  the 
UCI  Machine  Learning  Repository.  Figure  5  plots  the 
learning  curves  obtained  for  the  four  learning  methods 
on  each  of  the  eight  data  sets.  Each  graph  plots  the 
predictive  accuracy  (in  percentage)  of  the  four  learning 
methods  measured  using  the  separate  test  data  at  ev¬ 
ery  50  trials.  It  is  clearly  seen  from  these  graphs  that 
in  all  eight  data  sets,  the  two  proposed  query  learn¬ 
ing  methods  achieve  significant  improvement  in  data 
efficiency  as  compared  to  C4.5.  One  can  see  that  at 
very  early  stage  in  learning,  say  around  50  to  150  tri¬ 
als  depending  on  the  data  set,  the  prediction  accuracy 
of  the  query  learning  methods  reaches  a  level  that  is 
achieved  by  C4.5  only  towards  the  end  of  the  data  set. 
Table  2  gives  concrete  figures  that  quantify  this  ob¬ 
servation.  Here,  ‘the  target  error  rate’  was  calculated 
using  the  error  rate  of  C4.5  in  the  last  100  trials."* 
Then,  we  checked  to  see  how  many  trials  it  took  for 
all  four  methods  to  reach  that  error  rate.  In  parenthe¬ 
ses,  we  also  exhibit  the  ratio  of  the  number  of  trials 
required  by  each  of  the  methods  to  that  of  C4.5.  One 
can  see  that  typically  the  data  efficiency  is  improved 
by  a  factor  of  2  to  4. 

The  speed-up  achieved  by  the  two  query  learning 
methods  compared  against  boosting  applied  on  C4.5 
is  less  dramatic  but  still  significant.  From  the 
graphs,  one  can  see  that  on  five  of  the  eight  data 
sets,  namely  breast-cancer-wisconsin,  tic-tac-toe,  iono¬ 
sphere,  house-votes-84  and  wdbc,  the  advantage  of  the 
query  methods  over  boosting  is  clear,  while  on  the 

®The  results  involving  WMPl  were  obtained  by  averag¬ 
ing  over  10  runs,  not  20  runs. 

*For  this  calculation,  we  fed  a  randomly  chosen  test 
example  after  each  trial,  and  the  prediction  error  of  the 
current  trial  was  calculated  by  the  average  prediction  error 
over  the  last  50  test  trials. 


other  three  it  is  less  obvious.  These  three  data  sets, 
crx,  liver-disorders,  and  pima-indians-diabetes  appear 
to  have  a  common  feature:  That  a  certain  level  of  ac¬ 
curacy  is  achieved  with  relatively  few  examples,  but 
from  then  on  the  accuracy  is  hardly  improved  as  the 
data  size  increases.  It  may  be  that  the  target  function 
of  these  data  sets  is  sufficiently  noisy  that  no  learning 
method  can  break  this  barrier.  The  increase  in  data 
efficiency  achieved  by  the  query  learning  methods  in 
comparison  to  boosting  is  summaried  in  Table  3,  sim¬ 
ilarly  as  before. 

All  the  evaluation  discussed  thus  far  has  been  based 
on  the  prediction  accuracy  measured  using  test  data, 
which  are  disjoint  from  the  training  data  or  the  query 
data  from  which  the  query  learning  methods  selected 
query  points.  As  we  remarked  earlier,  this  is  selec¬ 
tive  sampling  and  not  genuine  query  learning.  If  we 
measure  the  prediction  accuracy  of  query  learning  al¬ 
gorithms  with  respect  to  the  query  data,  then  this 
would  translate  to  a  genuine  query  learning  scenario, 
except  the  function  being  learned  is  solely  defined  by 
the  query  data,  only  on  those  points  that  are  in  the 
data.  We  took  this  view  point  and  examined  the  learn¬ 
ing  curves  for  the  four  methods  with  respect  to  this 
measure.  Figure  6  plots  these  learning  curves  for  the 
eight  data  sets  as  before.  One  can  more  clearly  see  the 
effect  of  query  learning  here  -  with  respect  to  all  but 
one  data  set  (pima-indians-diabetes),  the  accuracy  of 
the  two  query  learning  methods  rise  much  faster  than 
either  C4.5  or  boosting  on  C4.5.,  typically  achieving 
an  increase  in  data  efficiency  of  fator  3  to  6. 

On  one  of  the  eight  data  sets,  tic-tac-toe,  we  ran  the 
analogous  experiments  as  above  using  a  randomized 
version  of  WMPl  as  the  component  learning  algo¬ 
rithm.  Figure  4  plots  the  prediction  accuracy  achieved 
by  each  of  the  five  methods  at  the  end  of  every  50  tri¬ 
als.  Note  that  query  by  committee  can  now  be  applied 
because  we  use  a  randomized  component  algorithm. 
Here  much  of  the  tendency  observed  using  C4.5  car¬ 
ries  over.  Notice,  however,  that  here  the  two  proposed 
methods,  query  by  boosting  and  query  by  bagging, 
out-perform  query  by  committee.  Also,  in  this  case 
query  by  boosting  seems  to  do  better  than  query  by 
bagging,  at  least  for  a  wide  range  of  data  sizes.  The 
relative  performance  of  the  competing  query  learning 
methods  appear  to  depend  on  the  component  learn¬ 
ing  algorithm  (and  the  learning  problem).  Note  fur¬ 
ther  that  boosting  and  the  query  methods  applied  on 
WMPl  achieve  much  higher  accuracy  than  those  ap¬ 
plied  on  C4.5  on  this  particular  problem.  Interestingly, 
WMPl  itself  does  not  have  a  higher  accuracy  than 
C4.5,  but  both  boosting  and  query  by  boosting  applied 
on  WMPl  are  significantly  more  effective  than  those 
applied  on  C4.5..  This  observation  suggests  that  on 
component  algorithms  and  problems  on  which  boost¬ 
ing  is  effective,  query  by  boosting  may  do  better  than 
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name 

query  by 
bagging 

query  by 
boosting 

boosting 

C4.5 

total 

size 

target 
error  rate 
(C4.5) 

liver-disorders 

■fcMIllKnM 

310 

0.3685 

ionosphere 

91(0.39) 

97(0.41) 

143(0.61) 

236(1.0) 

315 

0.0935 

house-votes-84 

65(0.21) 

72(0.24) 

145(0.48) 

303(1.0) 

391 

0.0465 

wdbc 

82(0.26) 

88(0.28) 

208(0.66) 

314(1.0) 

512 

0.054 

crx 

64(0.50) 

100(0.79) 

119(0.94) 

127(1.0) 

621 

0.171 

breast-cancer- Wisconsin 

86(0.40) 

83(0.39) 

209(0.98) 

213(1.0) 

629 

0.072 

pima-indians-diabetes 

67(0.44) 

63(0.41) 

81(0.53) 

152(1.0) 

691 

0.2895 

tic-tac-toe 

236(0.39) 

243(0.40) 

308(0.51) 

609(1.0) 

862 

0.1445 

Table  2:  Data  efficiency  increase  achieved  with  respect  to  C4.5 


name 

query  by 
bagging 

query  by 
boosting 

boosting 

C4.5 

total 

size 

target 
error  rate 
(boosting) 

liver-disorders 

111(0.86) 

126(0.98) 

129(1.0) 

- 

310 

0.3305 

ionosphere 

121(0.50) 

119(0.49) 

243(1.0) 

- 

315 

0.073 

house-votes-84 

71(0.34) 

136(0.65) 

210(1.0) 

366(1.74) 

391 

0.04 

wdbc 

97(0.32) 

130(0.43) 

300(1.0) 

506(1.69) 

512 

0.0455 

crx 

86(0.60) 

140(0.97) 

144(1.0) 

- 

621 

0.146 

breast-cancer- Wisconsin 

103(0.34) 

92(0.31) 

301(1.0) 

391(1.30) 

629 

0.0495 

pima-indians-diabetes 

99(0.56) 

191(1.09) 

176(1.0) 

- 

691 

0.2475 

tic-tac-toe 

438(0.52) 

517(0.62) 

836(1.0) 

- 

862 

0.053 

Table  3:  Data  efficiency  increase  achieved  with  respect  to  boosting 


Figure  4:  Prediction  accuracy  on  test  data  on  tic-tac-toe.  Left:  Using  WMPl  as  the  component  algorith 
Right:  Using  C4.5  as  the  component  algorithm. 


Query  Learning  Strategies  using  Boosting  and  Bagging  7 


the  other  query  learning  methods  as  well. 

The  time  complexity  of  all  three  query  learning  meth¬ 
ods  we  considered  is  of  the  order  0{NTR  ■  F{N)), 
where  F{N)  is  the  time  complexity  of  the  component 
algorithm  when  run  on  an  input  sample  of  size  N . 
This  is  a  tractable  but  significant  increase  in  compu¬ 
tation  cost  as  compared  to  the  component  algorithm. 
The  judgement  of  whether  the  data  efficiency  brought 
about  by  these  methods  justifies  the  additional  com¬ 
putational  burden  would  depend  on  the  exact  applica¬ 
tion  under  consideration.  Also  note  that  both  query 
by  committee  and  query  by  bagging  are  parallelizable 
with  respect  to  T  and  R,  but  query  by  boosting  is  par¬ 
allelizable  only  with  respect  to  R,  and  not  T.  Thus, 
only  when  query  by  boosting  buys  significantly  more 
data  efficiency,  would  it  be  the  method  of  choice. 

5  Concluding  Remarks 

We  proposed  two  variants  of  query  by  committee  that 
can  be  applied  on  an  arbitrary  component  algorithm, 
be  it  deterministic  or  randomized,  by  incorporating 
the  ideas  of  boosting  and  bagging.  Experiments  on 
data  sets  from  the  UCI  Machine  Learning  repository 
demonstrated  that,  when  using  them  with  C4.5  as 
the  component  algorithm,  the  proposed  query  learn¬ 
ing  methods  achieve  significant  increase  in  data  effi¬ 
ciency  as  compared  to  both  C4.5  and  boosting  applied 
on  C4.5.  On  one  of  the  data  sets  which  can  be  cast 
as  an  n-ary  learning  problem,  we  tested  these  methods 
using  a  randomized  weighted  majority  prediction  algo¬ 
rithm  for  n-ary  relations  as  the  component  algorithm, 
and  found  that  the  proposed  methods  performed  bet¬ 
ter  than  query  by  committee.  In  the  near  future,  we 
plan  to  carry  out  more  systematic  evaluation  to  verify 
the  robustness  of  the  proposed  query  methods  on  the 
choice  of  the  component  algorithm  and  the  learning 
problem. 
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Figure  5:  Learning  curves  for  four  learning  methods  on  the  UCI  ML  Repository. 


accuracy  (%)  accuracy  (%)  accuracy  (%)  accuracy  (%) 


Query  Learning  Strategies  using  Boosting  and  Bagging  9 


liver- disorders 


house- votes-84 


100 

98 

- 

- 1 - 

96 

- 

94 

92 

Q  by  boosting  - 

Q  by  bagging  - 

boosting  . 

90 

C4.5  . 

—I _ ■  » _ 

0  50  100  150  200  250 

#  training  data 


crx 


0  50  100150200250300350400450500 
#  training  data 


pima-indians-diabetes 


0  100  200  300  400  500  600  700 

#  training  data 


ionosphere 


wdbc 


#  training  data 


breast-cancer- Wisconsin 


100 

— 1 - r 

1  1  1  .  1  1  1  1 

98 

r" 

^  - - 

96 

-  / 

. . 

94 

-  / 

. 

92 

-  '  / 

Q  by  boosting  -  - 

Q  by  bagging  - 

90 

• 

boosting  .  - 

C4.5  . 

88 

- 

86 

-  1  1 

- 1 - 1 - 1 - 1 _ 1 _ L__  1 

0  50  100150200250300350400450500 
#  training  data 


tic-tac-toe 


#  training  data 


Figure  6;  Learning  curves  on  ‘query  data’  for  four  learning  methods  on  the  UCI  ML  repository. 
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Abstract 

Genetic  Programming  (GP)  is  a  machine 
learning  technique  that  was  not  conceived 
to  use  domain  knowledge  for  generating  new 
candidate  solutions.  It  has  been  shown  that 
GP  can  benefit  from  domain  knowledge  ob¬ 
tained  by  other  machine  learning  methods 
with  more  powerful  heuristics.  However,  it 
is  not  obvious  that  a  combination  of  GP 
and  a  knowledge  intensive  machine  learning 
method  can  work  better  than  the  knowledge 
intensive  method  alone.  In  this  paper  we 
present  a  multi-strategy  approach  where  an 
analytical  and  inductive  approach  (hamlet) 
and  an  evolutionary  technique  based  on  GP 
(EvoCK)  are  combined  for  the  task  of  learn¬ 
ing  control  rules  for  problem  solving  in  plan¬ 
ning.  Results  show  that  both  methods  com¬ 
plement  each  other,  supplying  to  the  other 
method  what  the  other  method  lacks  and  ob¬ 
taining  better  results  than  using  each  method 
alone. 


1  INTRODUCTION 

Genetic  Programming  (GP)  is  a  machine  learning 
technique  based  on  a  search  over  a  huge  state 
space  [Koza  and  Rice,  1991].  Therefore,  as  any  search 
method,  it  can  be  defined  in  terms  of  three  elements: 
an  initial  state,  a  set  of  operators,  and  a  heuristic  func¬ 
tion  (called  fitness  function).  GP  expands  the  ideas 
of  Genetic  Algorithms  by  using  structured  representa¬ 
tions  (trees).  The  use  of  this  type  of  representation 
is  more  appropriate  for  solving  symbolic  tasks  than 
Genetic  Algorithms. 


One  of  such  tasks  consists  on  learning  control  knowl¬ 
edge  for  problem  solving.  Problem  solving  can  also  bo 
described  in  terms  of  a  search  in  another  state  space 
than  the  one  of  GP.  Traditional  approaches  use  domain 
independent  planners  for  generating  plans  [Blum  and 
Furst,  1995,  Penberthy  and  Weld,  1992].  PRODIGY, 
an  architecture  for  planning  and  learning  that  uses  a 
means-ends  analysis  nonlinear  planner,  is  one  of  such 
systems  [Veloso  et  al,  1995].  However,  planning  be¬ 
comes  impractical  for  large  problems.  In  order  to  gain 
efficiency,  prodigy  must  be  supplied  with  domain- 
dependent  search  control  knowledge  which  can  be  ap¬ 
plied  at  decision  points  in  the  planning  reasoning  cycle. 
This  control  knowledge  has  the  form  of  control  rules, 
as  further  explained  later  on. 

In  this  type  of  tasks,  the  use  of  all  available  domain 
knowledge  is  essential  for  an  efficient  learning  process. 
Classically,  GP  systems  have  only  used  domain  knowl¬ 
edge  for  the  fitness  function.  We  propose  the  use  of 
background  knowledge  coming  from  the  use  of  a  pre¬ 
vious  learning  technique  also  in  another  two  search 
elements  [Aler  et  al,  1998a]:  first,  the  initial  state  will 
not  be  created  randomly,  but  using  control  knowledge 
learned  by  another  method,  HAMLET  in  this  case  [Bor¬ 
rajo  and  Veloso,  1997].  Second,  genetic  operators  will 
use  knowledge  in  the  form  of  examples,  obtained  as  a 
sub-product  of  HAMLET  learning  process. 

In  [Aler  et  al,  1998a]  we  have  shown  that  GP  ob¬ 
tains  much  better  results  in  planning  by  using  such 
background  knowledge.  The  purpose  of  this  paper  is 
to  show  that  a  multi-strategy  approach  using  GP  and 
HAMLET  works  better  than  using  each  method  alone. 
This  multi-strategy  approach  can  be  seen  as  a  com¬ 
bination  of  learning  bias  from  different  methods:  GP 
and  HAMLET.  In  this  paper,  we  have  used  PRODIGY, 
but  in  the  future  other  planners  such  as  UCPOP  or 
Graphplan  might  be  used. 
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Section  2  explains  the  role  of  learning  in  planning.  Sec¬ 
tion  3  describes  our  multi-strategy  approach  for  learn¬ 
ing  in  planning.  Section  4  describes  our  experimental 
setup  and  the  results  obtained.  Section  5  discusses 
these  results,  and  presents  the  conclusions.  Finally, 
Section  6  surveys  related  work. 

2  THE  LEARNING  TASK 

The  learning  task  can  be  stated  as:  given  a  set  of  traces 
belonging  to  problems  solved  by  PRODIGY  in  a  particu¬ 
lar  planning  domain,  induce  a  set  of  control  rules  that 
perform  well  in  that  planning  domain.  Control  rules 
help  PRODIGY  to  make  decisions  at  several  points  in 
its  search  process.  If  there  are  no  applicable  control 
rules  in  a  decision  point,  PRODIGY  will  make  a  default 
decision.  It  has  five  kinds  of  decision  points:^ 

•  Select,  prefer  or  reject  a  goal  from  the  set  of  pend¬ 
ing  goals. 

•  Select,  prefer  or  reject  an  operator  to  achieve  a 
goal. 

•  Select,  prefer  or  reject  a  binding  for  the  chosen 
operator. 

•  Choose  whether  to  apply  an  instantiated  applica¬ 
ble  operator  or  to  subgoal  on  an  unachieved  goal. 

•  Select,  prefer  or  reject  an  instantiated  operator 
from  the  set  of  applicable  instantiated  operators. 

Figure  1  shows  an  example  of  a  control  rule 
for  the  blocksworld  domain,  current -goal,  and 
true-in-state  are  meta-predicates.  The  control  rule 
says  that  if  prodigy  is  working  on  trying  to  hold  an 
object,  <objectl>,  and  this  object  is  on  top  of  an¬ 
other,  <object2>,  in  the  current  state,  then  PRODIGY 
should  select  the  operator  UNSTACK  and  reject  the  rest 
of  operators  that  could  achieve  the  same  goal. 

fcontrol-rule  select-operators-unstack 

(if  (and  (current-goal  (holding  <objectl>)) 

(true-in-state  (on  <objectl>  <object2>)))) 

(then  select  operator  unstack)) 


Figure  1:  Example  of  a  control  rule  for  making  the 
decision  of  what  operator  to  use. 

^HAMLET  only  generates  selection  control  rules.  In  this 
article,  GP  will  look  just  for  that  kind  of  control  rules,  so 
that  it  can  be  properly  compared  with  hamlet. 


At  every  decision  point,  prodigy  is  in  a  particular 
search  meta-state.  Let  ME  he  the  set  of  all  possi¬ 
ble  meta-states.  Now,  helping  prodigy  to  take  de¬ 
cisions  can  be  stated  as:  for  each  possible  decision 
(for  example:  select  goal  (on  x  y))  find  a  parti¬ 
tion  ol  ME  into  ME-\-  (where  the  decision  should  be 
taken)  and  ME-  (where  the  decision  should  not  be 
taken).  That  is,  control  rules  are  actually  classifica¬ 
tion  rules:  they  partition  the  space  of  meta-states  into 
those  meta-states  that  belong  to  a  possible  decision 
and  those  that  do  not.  And  this  looks  like  traditional 
machine  learning  concept  induction,  where  classifica¬ 
tion  rules  have  to  be  induced  from  a  set  of  examples. 
In  this  case,  it  has  the  following  characteristics: 

•  Several  target  concepts  have  to  be  learnt  from  the 
same  data  (set  of  traces).  Not  only  there  are  dif¬ 
ferent  kinds  of  target  concepts  associated  to  each 
kind  of  decision  (select  operator,  select  goal,  etc) 
but  each  kind  of  decision  has  several  associated 
target  concepts.  For  instance,  there  will  be  one 
target  concept  of  the  type  select  operator  for  each 
possible  (operator,  goal)  pair  of  a  particular  do¬ 
main. 

•  Target  concepts  will  generally  be  disjunctive  (that 
means  that  several  control  rules  will  be  needed  to 
represent  a  target  concept). 

•  The  representation  of  concepts  is  relational,  so  we 
are  dealing  with  an  ILP  problem. 

Therefore,  when  using  GP,  each  individual  will  be  a  set 
of  control  rules,  represented  as  a  structure  that  will  be 
explained  in  Section  3.2.  A  GP  population  is  made  of 
several  such  individuals. 

3  A  MULTI-STRATEGY 

APPROACH  FOR  LEARNING 
CONTROL  KNOWLEDGE 

In  this  section  we  will  describe  the  architecture  of  the 
learning  system,  and  define  the  learning  behavior  in 
terms  of  its  three  learning  biases. 

3.1  ARCHITECTURE  OF  THE 
LEARNING  SYSTEM 

The  general  architecture  of  our  system  consists  of  five 
blocks  (as  also  shown  in  Figure  2).  The  main  blocks 
are  EvoCK  (“Evolution  of  Control  Knowledge”)  and 
HAMLET. 
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Figure  2:  General  Architecture  of  the  multi-strategy 
approach. 

EvoCK  is  the  module  that  implements  the  GP 
paradigm  adapted  for  evolving  planning  control  rules. 
EvoCK  is  supplied  with  fitness  cases  generated  by  a 
problem  generator.  These  fitness  cases  are  planning 
problems  generated  at  random  by  the  problem  gener¬ 
ator.  In  order  to  evaluate  individuals  from  the  popula¬ 
tion  with  the  fitness  cases  set,  EvoCK  tells  prodigy 
to  load  the  individual  and  try  to  solve  each  one  of 
the  fitness  cases.  Performance  of  this  individual  with 
these  fitness  cases  is  returned  to  EvoCK.  hamlet  has 
a  similar  relation  with  prodigy  and  the  problem  gen¬ 
erator  but  in  this  case  the  information  returned  by 
PRODIGY  is  the  search  tree  that  hamlet  will  use  to 
generalize  and  refine  its  control-rules. 

EvoCK  and  Hamlet  are  weakly  coupled  in  the  fol¬ 
lowing  way.  First,  Hamlet  is  run  to  learn  from  a  set 
of  randomly  generated  problems.  Then,  two  of  its  out¬ 
puts  are  used  as  background  knowledge  for  EvoCK: 
the  set  of  rules  learned  by  Hamlet  (“Hamlet  indi¬ 
vidual”)  are  used  to  seed  the  EvoCK  initial  popu¬ 
lation.  Also,  the  Hamlet  supplies  a  set  of  positive 
examples  (“Background  Knowledge  Population”)  that 
will  be  taken  as  input  by  one  of  the  genetic  operators 
(knowledge  based  crossover  [Aler  et  al,  1998a]).  This 
will  be  explained  in  subsection  3.3. 

When  EvoCK  gets  to  the  maximum  number  of  evalu¬ 
ations  allowed  for  learning,  it  returns  its  best  individ¬ 
ual  obtained  so  far.  Although  not  shown  in  Figure  2, 
best  individuals  are  tested  with  a  different  set  of  plan¬ 
ning  problems  (also  obtained  from  the  problem  gener¬ 
ator)  to  check  how  well  they  have  generalized  from  the 
training  data. 

In  the  next  three  sections,  we  describe  the  system  by 
explaining  its  learning  biases.  These  biases  are  classi¬ 
fied  following  Utgoff  [Utgoff,  1986]  in  language  biases, 
exploration  biases  and  evaluation  biases. 


3.2  THE  LANGUAGE  BIAS 

Usually,  in  GP  there  are  no  constrains  in  the  struc¬ 
ture  that  is  to  evolve;  any  combination  of  functions 
and  terminals  will  be  valid  and  crossover  points  can 
be  taken  at  any  place  in  the  individual.  But,  in 
our  case,  prodigy  restricts  what  are  valid  structures 
and  what  are  not.  For  instance,  a  meta-predicate 
like  TRUE- IN-STATE^  can  only  be  passed  as  argument 
a  goal  like  (on  <x>  <y>)  but  not  an  operator  like 
PUT-DOWN.  Other  general  constrains  are  imposed  by  the 
structure  of  the  rule  language  itself  (if  <  condition  > 
then  <action>,  etc).  In  many  cases  this  problem  can 
be  solved  by  achieving  operational  closure,  that  is, 
by  allowing  each  function  to  accept  any  type  of  re¬ 
sult  [Koza  and  Rice,  1991].  However,  this  is  not  pos¬ 
sible  in  this  case,  since  PRODIGY  fixes  the  structure  of 
the  language  for  representing  control  rules  and  feeding 
it  with  non-valid  control  rules  would  make  it  fail. 

Therefore,  we  have  chosen  to  constrain  structures  to 
PRODiGY-valid  ones  (in  the  literature,  such  structures 
are  called  “constrained  structures”  [Koza  and  Rice, 
1991]  or  “strongly  typed  structures”  [Montana,  1995]). 
In  order  to  achieve  it,  the  following  three  steps  must 
be  followed;  create  only  valid  structures,  crossover 
points  must  be  of  the  same  type  and  mutation  op¬ 
erators  must  take  into  account  the  type  of  the  mu¬ 
tation  point.  The  first  step  is  achieved  by  using  an 
special-purpose  production  grammar.  An  example  of 
an  individual  generated  by  the  grammar  might  be  the 
one  that  appears  in  Figure  3.  This  individual  consists 
of  two  control  rules  for  the  blocksworld  domain.  The 
first  one  checks  whether  there  is  a  block  with  no  other 
blocks  on  it  and  if  the  planner  is  trying  to  solve  ei¬ 
ther  putting  that  object  on  another  object  or  having 
the  robot  arm  hold  a  third  different  object.  If  both 
conditions  succeed,  then  the  planner  will  work  next 
in  the  (on  <object-l>  <object-2>)  goal.  The  other 
control  rule  says  that  if  there  is  an  object  on  the  table 
and  the  system  is  trying  to  bind  the  pick-up  operator, 
then  it  should  be  bound  to  that  object. 

3.3  THE  EXPLORATION  BIAS 

The  exploration  bias  includes  everything  related  to  the 
search  policy:  search  operators,  background  knowl¬ 
edge  to  constrain  the  search,  etc.  The  system  uses 

*Meta-predicates  are  functions  that  have  access  to 
PRODIGY  meta-state.  Therefore  they  can  check  whether 
a  condition  is  true  or  not  in  the  meta-state.  For  instance 
TRUE-IN-STATE  tests  if  a  particular  condition  is  true  in  the 
current  planning  state 
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(list  (rule  (and  (true-in~state  (clear  <object-l>)) 
(some-candidate-goals 

(goals-list  (on  <object-l>  <object“2>)) 
(holding  <object-3>))) 
(select-goal  (on  <object-l>  <object-2>))) 

(rule  (true-in-state  (on-table  <object-l>)) 

(select-bindings  (pick-up-b  <object-l>  )))) 

Figure  3:  Example  of  EvoCK  individual. 

the  traditional  GP  operators  (crossover  and  mutation) 
and  some  others  specially  tailored  for  the  learning  task. 
The  whole  operator  set  is: 

•  Copy;  reproduction  without  modification. 

•  Xover:  traditional  crossover.  It  takes  two  con¬ 
strained  structures  and  produces  one  constrained 
structure 

•  Changingjmutation;  it  chooses  a  mutation 
point,  and  changes  the  whole  subtree  by  another 
randomly  generated  subtree.  This  mutation  is 
equivalent  to  Xover  with  a  randomly  generated 
individual  (as  the  second  parent). 

•  Xover_ad[d:  some  points  in  the  evolving  struc¬ 
ture  allow  for  lists  of  elements  of  the  same  kind 
(as,  for  instance,  lists  of  goals).  In  those  cases, 
crossover  adds  elements  to  the  lists  from  the  other 
parent,  instead  of  replacing  the  whole  list. 

•  Chopping_ofF_mutation:  in  those  points  where 
lists  of  elements  of  the  same  kind  are  allowed,  it 
removes  one  of  the  elements. 

•  Growingjnutation:  it  adds  a  random  subtree 
at  those  points  where  lists  of  elements  of  the  same 
type  are  allowed.  It  is  equivalent  to  Xover_add 
with  a  randomly  generated  individual  (as  the  sec¬ 
ond  parent). 

All  these  operators  are  simple  variations  of  genetic  op¬ 
erators  traditionally  used  in  GP.  The  next  two  opera¬ 
tors  are  specially  tailored  for  this  learning  task. 

•  Join:  it  selects  one  variable  in  the  control  rule 
(like  <object-l>)  and  substitutes  it  by  any  other 
variable  in  the  control  rule.  The  rationale  behind 
this  operator  is  that  sometimes  there  are  condi¬ 
tions  in  a  rule  that  are  not  related  with  other 
conditions  by  conimon  variables.  Sometimes  that 
is  undesirable.  For  instance,  if  we  have  a  control 
rule  to  pick-up  an  object  <obj  1>  when  some  con¬ 
ditions  are  true,  our  experience  says  that  many 


of  those  conditions  should  refer  to  <objl>.  The 
join  operator  is  a  simple  way  of  creating  these 
references. 

•  Up_the_hierarchy:  objects  (the  elements  to 
which  the  planning  operators  are  applied)  in 
PRODIGY  are  organized  in  a  tree-shaped  type  hi¬ 
erarchy.  For  instance,  in  logistics  transportation 
planning  domain,  there  are  trucks  and  planes, 
which  are  both  defined  as  carriers.  This  genetic 
operator  would  take  a  truck-typed  variable  in  the 
left  hand  side  of  the  rule  and  would  substitute  all 
its  instances  by  a  carrier-typed  variable.  Thus, 
the  control  rule  would  become  more  general. 

The  related  specialization  operators  (i.e.  disjoin  and 
down_the_hierarchy)  are  not  included  in  the  operator 
pool;  we  are  imposing  a  strong  bias  towards  general¬ 
ization.  However,  the  system  can  still  specialize  by 
means  of  the  other  generic  operators  (mutation,  etc). 

Background  knowledge  can  be  introduced  to  the  sys¬ 
tem  in  order  to  restrict  the  search.  So  far,  we  have 
used  two  kinds  of  background  knowledge: 

•  Seeding  the  initial  population  with  an  individual 
coming  from  hamlet. 

•  The  early  phase  of  hamlet  returns  a  set  of  posi¬ 
tive  and  negative  examples  as  a  sub-product.  Pos¬ 
itive  examples  are  those  where  prodigy  made  the 
right  decision  in  the  planning  process.  These  pos¬ 
itive  examples  can  be  easily  transformed  into  con¬ 
trol  rules  and  then  into  GP  individuals.  Then,  the 
crossover  operator  will  be  able  to  draw  individ¬ 
uals  from  the  background  knowledge  population 
instead  of  the  evolving  population  (this  is  what 
we  have  called  “knowledge-based  crossover  opera¬ 
tor”  [Aler  et  al.,  1998a]).  In  that  way,  background 
knowledge  can  be  injected  into  the  evolving  pop¬ 
ulation. 

Finally,  we  use  a  steady  state  GP  with  a  generational 
gap  of  1.  2-tournaments  are  held  for  both  selection  and 
replacement.  This  has  been  shown  experimentally  to 
behave  well. 

3.4  THE  EVALUATION  BIAS 

The  evaluation  bias  concerns  the  preference  criteria 
used  by  GP  for  selecting  an  individual  over  another, 
which  is  coded  as  a  fitness  function.  In  our  case,  we 
devised  a  hierarchical  fitness  function  that  contains  the 
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following  components  [Aler  et  al,  1998b,  Aler  et  ai, 
1998a]: 

1.  Performance  in  fitness  cases:  to  maximize. 
How  well  the  individual  performs  when  PRODIGY 
tries  to  solve  the  training  planning  problems  when 
guided  by  the  individual  (acting  as  a  set  of  control 
rules).  It  will  explained  later  in  more  detail. 

2.  Number  of  different  variables:  to  minimize. 
This  fitness  component  is  related  to  the  same  bias 
than  the  join  operator.  We  want  to  have  as  many 
meta-predicates  in  the  left  hand  side  of  the  control 
rules  inter-related  by  common  variables  as  possi¬ 
ble. 

3.  Number  of  different  true-in-state  meta¬ 
predicates:  to  minimize.  The  fewer  true-in-state 
meta-predicates,  the  more  general  and  faster  will 
run  the  set  of  control  rules. 

4.  Number  of  different  goals  in  some- 
candidate-goals  meta-predicates:  to  maxi¬ 
mize.  This  meta-predicate  returns  true  if  at  least 
one  of  its  arguments  is  a  candidate  goal  to  be 
solved  by  the  planner.  So,  the  more  goals  has 
some-candidate-goals,  in  more  cases  it  will  be  ap¬ 
plicable  and  the  more  general  it  will  be  (although 
less  compact). 

5.  Number  of  different  some-candidate-goals: 
to  maximize.  Another  way  of  making  a  rule 
more  general  is  to  get  rid  of  unnecessary  some- 
candidate-goals  checking.  This  also  makes  it 
faster. 

6.  Number  of  control  rules:  To  minimize.  The 
fewer  control  rules,  the  faster  will  the  individual 
solve  the  problems. 

7.  Individual  size  (in  nodes):  To  minimize. 

All  individuals  in  the  tournament  set  that  have  the 
same  score  in  the  first  comparison  will  pass  to  the  sec¬ 
ond  one  and  so  on.  The  rest  will  be  dropped  off  the 
tournament.  If  more  than  one  individual  happen  to 
pass  the  last  comparison,  the  tournament  winner  is 
chosen  randomly. 

The  first  criteria,  performance  in  fitness  cases,  was 
formerly  computed  by  measuring  how  many  steps  of 
the  solution  of  a  given  planning  problem  the  individ¬ 
ual  managed  to  follow  (solutions  to  all  the  planning 
problems  were  known  by  EvoCK  in  advance  by  let¬ 
ting  PRODIGY  solve  those  problems  and  storing  the 


search  trees).  However,  although  we  obtained  good  re¬ 
sults,  we  realized  that  an  individual  managing  to  follow 
many  steps  in  the  solution  didn’t  guarantee  that  the 
individual  would  actually  solve  the  problem.  There¬ 
fore,  we  have  decided  to  change  it  for  a  set  of  three 
new  criteria: 

•  Number  of  problems  solved  by  PRODIGY  be¬ 
ing  guided  by  the  individual  with  a  maximum 
node  limit.  To  maximize.  This  node  limit  is  four 
times  the  amount  of  nodes  that  would  be  needed 
to  solve  the  problem  if  PRODIGY  could  go  straight¬ 
forward  to  the  solution. 

•  Number  of  problems  solved  by  the  individ¬ 
ual  more  efficiently  than  PRODIGY  alone.  To 
maximize.  Efficiency  in  this  case  means  fewer 
nodes  expanded. 

•  Total  number  of  nodes  expanded  by  the  in¬ 
dividual.  To  minimize. 

In  order  to  test  an  individual  with  these  new  criteria,  it 
has  to  be  loaded  into  PRODIGY.  Then,  prodigy  will  be 
run  for  each  of  the  planning  problems  for  learning  (or 
fitness  cases,  in  GP  terminology).  However,  complex 
problems  need  to  be  given  a  high  node  limit  if  they 
are  to  be  solved.  As  many  such  evaluations  must  be 
performed  for  each  generation,  only  simple  problems 
can  be  used  for  learning  (otherwise  the  fitness  function 
would  take  too  long).  This  is  another  bias  to  take 
into  account.^  However,  [Borrajo  and  Veloso,  1997] 
shows  empirically  that  training  with  simple  problems 
is  enough  for  learning  control  knowledge  useful  to  solve 
more  complex  problems. 

4  EXPERIMENTAL  RESULTS 

In  order  to  test  our  multi-strategy  approach,  the  fol¬ 
lowing  steps  were  carried  out: 

1.  Hamlet  was  trained  with  400  learning  planning 
problems.  Two  domains  were  used:  blocksworld 
and  logistics.  A  set  of  control  rules  and  a  set  of 
positive  examples  were  obtained  for  each  domain. 
They  were  used  as  background  knowledge  in  the 
next  step. 

2.  EvoCK  was  trained  in  the  blocksworld  and  logis¬ 
tics  with  192  and  188  learning  planning  problems 

®  [Aler  ef  ai,  1998b,  Aler  et  al,  1998a]  wa-s  not  con¬ 
strained  by  this  bias. 
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respectively.  A  population  size  of  2  was  used.  Cer¬ 
tainly,  a  population  size  of  2  is  not  common  in  GP. 
Previous  work  [Aler  et  al,  1998a]  shows  that  using 
a  bigger  population  seems  to  be  good  but  results 
are  not  conclusive:  the  interaction  between  popu¬ 
lation  size  and  seeding  the  initial  population  is  not 
properly  understood  yet.  In  our  case,  the  seeding 
individual  (coming  from  Hamlet)  is  much  better 
than  the  other  initial  individuals  (randomly  gen¬ 
erated)  therefore  two  things  might  happen:  first, 
during  the  earlier  generations  the  seeding  individ¬ 
ual  would  not  be  selected  very  often,  so  some  time 
would  be  spent  evaluating  individuals  that  con¬ 
tain  no  knowledge.  Second,  if  the  seeded  individ¬ 
ual  is  much  better  than  the  randomly  generated 
individuals,  in  the  long  term  all  members  might 
contain  similar  genetic  information  to  the  seeded 
individual  [Fraser  and  Rush,  1994].  In  this  paper 
a  population  of  2  has  been  chosen  because  in  that 
way,  we  make  sure  that  genetic  operators  will  al¬ 
ways  act  on  individuals  which  contain  knowledge 
and  therefore,  the  impact  of  knowledge  will  be 
better  controlled.  In  any  case,  we  plan  to  carry 
out  several  experiments  that  will  study  the  pop¬ 
ulation  size-population  seeding  interaction  in  de¬ 
tail.  Performing  crossover  in  such  a  small  popu¬ 
lation  is  not  meaningful,  so  standard  crossover  is 
not  used  in  this  paper.  However,  EvoCK  can 
use  it  in  general.  Background  knowledge  from 
the  previous  step  was  used  in  the  two  ways  de¬ 
scribed  in  subsection  3.3.  As  GP  is  a  stochastic 
method,  several  experiments  were  carried  out  for 
each  domain:  47  for  the  blocks  world  and  54  for 
logistics.  Each  experiment  ran  for  100.000  eval¬ 
uations.  From  each  experiment,  a  set  of  control 
rules  was  obtained. 

3.  Hamlet  was  trained  in  a  similar  manner  than 
EvoCK.  Hamlet  started  with  the  sets  of  control 
rules  obtained  in  step  1  and  refined  them  with 
the  rest  of  the  learning  problems  used  to  train 
EvoCK.  Two  sets  of  control  rules  were  obtained 
(one  for  each  domain). 

4.  Finally  the  sets  of  control  rules  obtained  by 
EvoCK  and  HAMLET  were  tested  with  a  new 
set  of  problems  (416  for  the  blocksworld  and  347 
for  logistics)  in  the  same  conditions.  Results  are 
shown  in  Table  1.  As  EvoCK  obtained  one  set 
of  rules  from  each  experiment,  two  quantities  are 
shown:  the  number  of  problems  solved  by  the  best 
of  all  sets  of  control  rules  (along  with  the  number 
of  control  rules  for  that  individual)  and  the  aver¬ 


age  number  of  problems  solved  over  all  sets. 


Table  1:  Results  for  prodigy,  hamlet  and  EvoCK 
in  both  the  blocksworld  and  logistics  domains. 


%  Prob. 
Solved 

Number 
of  Rules 

Average 
%  P.  Solv. 

1  Blocksworld 

PRODIGY  ALONE 

21% 

HAMLET  SEED 

58%“ 

12 

HAMLET 

18% 

13 

EvoCK  (best  indiv.) 

87%“ 

4 

00 

o 

Logistics 

PRODIGY  ALONE 

43% 

HAMLET  SEED 

52%“ 

56 

HAMLET 

46% 

64 

EvoCK  (best  indiv.) 

95% 

19 

65%“ 

Table  1  shows  that  when  HAMLET  tries  to  refine  and 
improve  a  set  of  control  rules  previously  learned  (ham¬ 
let  seed  in  Table  1),  the  percentage  of  test  problems 
actually  solved  drops:  in  the  blocksworld  it  goes  from 
58%  to  18%,  in  logistics  it  gets  from  52%  to  46%.  On 
the  other  hand,  EvoCK  improves  the  set  of  control 
rules  given  as  seed  for  the  initial  population:  58%  to 
87%  in  the  blocksworld  and  52%  to  95%  in  logistics. 
Next  section  comments  on  these  results.  It  is  also  no¬ 
ticeable  that  EvoCK  produces  individuals  with  fewer 
control  rules  than  the  seeding  individual  (12  to  4  con¬ 
trol  rules  in  the  blocksworld  and  56  to  19  in  logistics) 
hence  returning  more  efficient  individuals.  In  order 
to  show  that  the  control  rules  learned  are  general  and 
useful  for  more  complex  problems,  a  breakdown  of  the 
results  are  displayed  in  Tables  2  and  3. 


Table  2:  Breakdown  of  the  number  of  testing  problems 
solved  in  the  blocksworld  by  HAMLET  and  EvoCK  ac¬ 
cording  to  the  number  of  goals  and  of  objects). 


#  Goals 

#  Objects 

PRODIGY 

HAMLET 

seed 

HAMLET 

EvoCK 

50 

60 

6^ 

0% 

Wo 

56% 

20 

50 

6% 

31% 

4% 

81% 

20 

20 

6% 

27% 

4% 

69% 

10 

50 

21% 

67% 

19% 

96% 

10 

20 

15% 

56% 

15% 

83% 

10 

15 

31% 

48% 

15% 

85% 

5 

50 

15% 

70% 

2% 

92% 

5 

20 

15% 

82% 

18% 

95% 

5 

15 

40% 

82% 

35% 

98% 

5 

10 

50% 

85% 

60% 

95% 

Tables  2  and  3  show  a  breakdown  of  the  number 
of  problems  solved  by  the  different  methods  in  the 
blocksworld  according  to  problem  complexity.  This 
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Table  3;  Breakdown  of  the  number  of  testing  problems 
solved  in  logistics  by  hamlet  and  EvoCK  according 
to  the  number  of  goals  and  of  objects). 


#  Goals 

^  Objects 

PRODIGY 

HAMLET 

seed 

HAMLET 

EvoCK 

50 

50 

0^ 

0^ 

0^ 

75^5 

20 

50 

3% 

0% 

0% 

100% 

20 

20 

7% 

28% 

0% 

83% 

10 

50 

13% 

0% 

0% 

100% 

10 

20 

20% 

53% 

33% 

100% 

10 

15 

20% 

67% 

47% 

100% 

10 

10 

7% 

67% 

40% 

100% 

5 

50 

42% 

0% 

0% 

100% 

5 

20 

58% 

83% 

67% 

100% 

5 

15 

42% 

42% 

67% 

100% 

5 

10 

25% 

58% 

67% 

100% 

5 

5 

33% 

83% 

92% 

100% 

2 

50 

90% 

60% 

20% 

100% 

2 

20 

90% 

100% 

100% 

100% 

2 

15 

90% 

90% 

100% 

100% 

2 

10 

90% 

80% 

90% 

100% 

2 

5 

100% 

100% 

100% 

100% 

2 

2 

100% 

100% 

100% 

100% 

1 

50 

100% 

100% 

80% 

100% 

1 

20 

90% 

100% 

100% 

100% 

1 

15 

90% 

100% 

100% 

100% 

1 

10 

90% 

100% 

100% 

100% 

1 

5 

100% 

100% 

100% 

100% 

1 

2 

100% 

100% 

100% 

100% 

complexity  is  measured  by  the  number  of  goals  and 
objects  in  the  planning  problem.  It  is  easy  to  see  that 
EvoCK  improves  drastically  with  respect  to  the  ini¬ 
tial  seed  (hamlet  seed)  by  solving  very  hard  prob¬ 
lems.  The  percentage  of  testing  problems  solved  for 
PRODIGY  working  alone,  the  initial  HAMLET  seed  and 
the  final  HAMLET  result  are  also  shown. 

5  DISCUSSION  AND 
CONCLUSIONS 

After  having  experimented  both  systems  (EvoCK  and 
hamlet)  we  can  draw  the  following  conclusions  and 
comparisons. 

•  Hamlet  does  not  have  a  trade-off  between  cor¬ 
rect  knowledge  and  utility  of  that  knowledge. 
Hamlet  manages  to  learn  quite  correct  knowl¬ 
edge  [Borrajo  and  Veloso,  1997]  but  sometimes 
having  a  lot  of  correct  control  rules  is  not  an  ad¬ 
vantage,  because  it  takes  a  long  time  to  use  it 
(this  is  called  the  utility  problem  [Minton,  1988]). 
This  explains  in  part  HAMLET  bad  behavior.  On 
the  other  hand,  our  results  in  [Aler  et  al.,  1998a] 
show  that  it  is  more  difficult  for  GP  alone  to  ob¬ 
tain  correct  knowledge.  However,  it  is  very  easy  to 
take  into  account  the  utility  problem  in  the  fitness 
function  (several  of  its  components  press  to  that 


end).  Thus,  we  see  that  our  multi-strategy  ap¬ 
proach  works  better  than  the  two  methods  alone 
by  combining  both  methods  biases. 

•  Another  problem  that  HAMLET  has  is  that  as  it 
is  a  lazy  incremental  system,  in  order  to  refine  an 
incorrect  control  rule  it  assumes  that  eventually  it 
will  find  an  appropriate  set  of  negative  examples. 
Given  that  the  potential  problem  space  is  infinite 
(huge  from  a  computational  point  of  view),  the 
likelihood  of  finding  that  appropriate  set  might  be 
very  small.  In  any  case,  previous  work  has  shown 
that  in  the  long  run  HAMLET  tends  to  converge  to 
the  correct  knowledge  [Borrajo  and  Veloso,  1997]. 
Since  GP  a  non-incremental  system,  it  is  able  to 
detect  negative  examples  at  once  by  evaluating 
the  whole  set  of  training  problems.  On  the  other 
hand,  non-incremental  methods  arc  less  efficient 
when  learning  in  complex  domains.  Again,  the 
complementary  aspects  of  both  systems  allow  to 
overcome  both  systems  deficiencies. 

•  Another  difference  between  using  GP  in  this 
way  and  more  traditional  learning  techniques 
is  that  even  using  background  knowledge,  its 
generalization  and  specialization  operators  do 
not  have  knowledge  about  how  planning  acts. 
On  the  contrary,  learning  techniques  such 
as  prodigy/ebl  [Minton,  1988],  or  hamlet 
“know”'*  how  to  generalize  or  specialize  in  plan¬ 
ning  domains.  GP  has  no  such  knowledge,  so 
many  of  the  genetic  modifications  will  not  work. 
Besides,  genetic  operators  arc  not  so  constrained 
by  powerful  heuristics,  so  they  might  get  different 
and  new  results  than  those  of  more  traditional 
methods.  Another  way  to  see  this  is  that  HAM¬ 
LET  (and  many  other  learning  methods)  take  ad¬ 
vantage  of  the  specific-to-general  ordering  of  the 
control  rule  space:  HAMLET  trajectory  through 
the  control  rule  space  consists  of  generalization 
or  specialization  steps,  in  reaction  to  new  exam¬ 
ples  [Shapiro,  1983].  GP  does  not  take  much 
advantage  of  this  specific-to-general  ordering.  A 
mixture  of  generalizations  and  specializations  arc 
performed  at  each  step  in  the  search.  Besides, 
generalization  operators  that  take  advantage  of 
the  ordering  heuristic  are  easily  added  to  the  op¬ 
erator  pool,  as  our  system  shows. 

•  Given  that  genetic  operators  do  not  handle  much 
knowledge,  they  are  faster  than  classical  learning 
search  operators. 

"'Or  at  least,  they  have  powerful  heuristics. 
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•  Hamlet  is  deterministic:  from  the  same  set  of 
training  cases,  it  will  always  obtain  the  same  set 
of  control  rules.  On  the  other  hand,  GP  is  stochas¬ 
tic:  it  can  be  run  several  times  and  obtain  differ¬ 
ent  knowledge  every  time. 

•  There  is  a  trade-off  between  understandability 
and  efficiency.  Hamlet  tends  to  produce  control 
knowledge  which  is  easier  to  understand  whereas 
EvoCK  control  knowledge  is  more  difficult  to  un¬ 
derstand  (but  more  efficient). 

•  Finally,  an  important  advantage  of  GP  over  the 
rest  of  learning  techniques  applied  to  problem 
solving  is  its  flexibility.  Very  different  learning  bi¬ 
ases  can  be  tested  without  changing  the  method 
itself.  Following  Utgoff’s  classification  [Utgoff, 
1986],  GP  biases  are: 

-  The  language  bias  can  be  changed  easily. 
That  is  not  the  case  with  many  other  learn¬ 
ing  techniques  applied  to  problem  solving, 
because  their  search  operators  depend  heav¬ 
ily  on  the  representation  language  used.  For 
instance,  hamlet  only  uses  a  subset  of  the 
control  rule  language  allowed  by  PRODIGY, 
while  GP  could  use  the  whole  set  easily. 

-  The  exploration  bias.  GP  uses  just  two  task 
independent  operators  (crossover  and  muta¬ 
tion).  However,  as  this  paper  shows,  many 
possible  variations  of  these  operators  can  be 
added,  as,  for  instance,  task  dependent  oper¬ 
ators  (like  generalization  and  specialization). 

-  The  evaluation  bias.  In  GP,  different  evalu¬ 
ation  biases  can  be  easily  combined  in  the 
same  evaluation  function.  Also,  it  is  very 
easy  to  change  from  a  fitness  function  to 
another.  In  fact,  in  this  paper  we  have 
presented  a  new  fitness  function  that  im¬ 
proves  previous  results  obtained  using  our 
scheme  [Aler  et  ai,  1998a,  Aler  et  al,  1998b]. 

6  RELATED  WORK 

There  have  been  different  approaches  to  acquire  con¬ 
trol  knowledge  for  non-trivial  (non-linear)  problem 
solving.  Some  of  them  use  analogy  [Kambhampati, 
1989,  Veloso  and  Carbonell,  1993],  others  pure  de¬ 
duction  [Katukam  and  Kambhampati,  1994,  Minton 
and  Zweben,  1993],  pure  induction  [Leckie  and  Zuker- 
man,  1991],  and  some  combine  deduction  and  induc¬ 
tion  [Borrajo  and  Veloso,  1997,  Estlin  and  Mooney, 
1996].  The  main  difference  with  our  approach  is  that 


they  did  not  combine  incremental  knowledge  intensive 
and  non-incremental  methods  (GP). 

Some  innovative  approaches  to  problem  solving  use  ge¬ 
netic  programming  [Koza,  1992].  This  approach  was 
started  by  Koza  [Koza,  1989,  Koza,  1992],  who  evolved 
a  planner  that  solved  a  very  specific  set  of  problems 
in  the  blocks  world  domain.  Handley  [Handley,  1994] 
used  GP  to  evolve  plans  for  specific  problems  in  the 
blocksworld  domain.  Muslea  [Muslea,  1997]  general¬ 
ized,  extended,  and  formalized  this  idea,  and  showed 
how  any  planning  problem  could  be  translated  to  an 
equivalent  GP  problem.  He  tested  it  successfully  in 
several  domains.  Spector  [Spector,  1994]  proposed  and 
analyzed  several  ways  in  which  GP  could  be  used  for 
planning.  The  main  difference  with  our  approach  is 
that  they  used  GP  to  search  in  the  plans  space. 
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Abstract 

In  this  paper  an  extensive  experimental  eval¬ 
uation  of  an  evolutionary  approach  to  con¬ 
cept  learning  is  presented.  The  experimen¬ 
tation,  performed  with  the  system  G-NET, 
investigates  the  effectiveness  of  the  approach 
along  the  following  dimensions:  Robustness 
with  respect  to  parameter  setting,  effective¬ 
ness  of  the  MDL  criterion  coupled  with  a 
stochastic  search  bias,  impact  of  coevolution 
on  the  quality  of  the  solution  and  on  the  com¬ 
putational  effort  required,  and  ability  to  face 
problems  requiring  structured  representation 
languages.  A  discussion  of  the  obtained  re¬ 
sults  and  a  suggestion  on  when  this  type  of 
approach  might  be  useful  is  also  provided. 

1  INTRODUCTION 

Supervised  concept  learning  has  been  tackled,  so  far, 
with  several  approaches,  including  symbolic,  connec- 
tionist  and  evolutive  ones.  Different  approaches  are 
better  suited  to  different  classes  of  problems,  depend¬ 
ing,  for  instance,  on  the  nature  of  data  or  the  avail¬ 
ability  of  domain-specific  knowledge. 

In  the  hope  of  making  a  little  step  ahead  in  the  di¬ 
rection  of  matching  learning  algorithms  to  problems, 
in  this  paper  we  present  an  experimental  exploration 
of  an  evolutionary  approach  to  the  task  of  learning 
concept  descriptions.  Our  exploration  is  articulated 
along  three  dimensions:  The  capability  of  dealing  with 
complex  representation  languages,  such  as  subsets  of 
predicate  logics;  the  exploitation  of  distributed  archi¬ 
tectures,  allowing  coevolution  to  be  efficiently  imple¬ 
mented;  the  interaction  between  the  stochastic  search 
bias  and  the  Minimum  Description  Length  (MDL) 


principle  (Rissanen,  1978),  used  as  evaluation  crite¬ 
rion  of  the  concept  description. 

The  experimentation  has  been  conducted  with  a  new 
version  of  G-NET  (Version  2.0),  a  descendant  of  the 
system  REGAL  (Giordana  and  Neri,  1996).  G-NET’s 
architecture  relies  on  a  computational  model  charac¬ 
terized  by  the  absence  of  global  memory,  which  ex¬ 
tends  the  diffusion  model  (Manderik  and  Spiessens, 
1989)  previously  developed  for  genetic  algorithms. 
With  respect  to  a  previous  implementation  (Anglano 
et  ah,  1997),  the  version  described  here  includes  an  ex¬ 
plicit  coevolutionary  strategy  based  on  (Potter  et  ah, 
1995),  a  new  objective  function  based  on  the  MDL 
principle,  and  an  improved  set  of  genetic  operators. 

A  first  point  emerged  from  the  experimentation,  using 
both  G-NET  and  REGAL,  is  that  evolutionary  search 
techniques  can  indeed  be  fruitfully  exploited  in  concept 
acquisition.  On  standard  benchmarks  they  showed 
performances  at  least  comparable  with  the  best  ones 
presented  in  the  literature  (Neri  and  Saitta,  1996). 

A  second  point  is  that  good  performance  does  not 
come  for  free:  Lfsing  a  simple  genetic  algorithm,  easy 
to  understand  and  quick  to  implement,  may  not  be  a 
solution.  The  evolutionary  inference  engine  has  to  be 
integrated  into  a  possibly  complex  architecture,  allow¬ 
ing  sophisticated  description  languages,  flexible  heuris¬ 
tic  learning  strategies,  and  distributed  computation  to 
be  accommodated. 

A  third  point  is  that  evolutionary  search  proved  to  be 
quite  robust,  because  it  did  not  require  any  parame¬ 
ter  tuning  over  a  range  of  different  problems.  Finally, 
stochastic  search  bias  proved  to  be  well  suited  to  differ¬ 
ent  evaluation  criteria  (Anglano  et  ah,  1997),  includ¬ 
ing  the  MDL  (Rissanen,  1978).  G-NET  is  based,  as 
REGAL  was,  on  the  theory  of  niches  and  species  for¬ 
mation,  which  already  proved  to  be  effective  in  learn- 
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ing  disjunctive  concept  definitions  (Giord;ina  and  Ncri. 
1996).  As  niches  and  species  formation  is  a  way  of  ad¬ 
dressing  multimodal  search  problems,  disjunctive  con¬ 
cept  induction  naturally  fils  in  this  framework.  How¬ 
ever,  methods  based  on  species  formation  may  recjuire 
large  populations  when  weak  sjjecies  must  survive  in 
the  presence  of  much  stronger  ones  (Giordana  and 
Neri,  1996). 

In  order  to  cope  with  this  problem,  G-NET  2.0  uses 
a  new  learning  method,  which  combines  the  Universal 
Suffrage  selection  scheme  (Giordana  and  Neri,  1996) 
with  an  explicit  coevolutionary  strategy,  similar  to  the 
one  proposed  in  (Potter  et  ah,  199.6).  Finally,  G-NET 
2.0  u.ses  a  new  set  of  genetic  operators,  which  explic¬ 
itly  aims  at  preserving  the  diversity  in  the  population, 
reducing  thus  premature  convergence  and  increasing 
the  effectiveness  of  the  genetic  search. 

2  EVOLUTIONARY  ALGORITHM 

As  its  ancestor  REGAL,  G-NET  learns  concept  de¬ 
scriptions  in  a  language  similar  to  VL^i  (Michalski, 
1983).  More  specifically,  a  concept  is  described  by  a 
set  $  =  {‘Pi,<P2,  •  ■  •  1  V’n}  of  Horn  clauses,  in  which  the 
construct  of  internal  disjunction  is  also  allowed.  In 
Logic  Programming,  an  internal  disjunction  is  a  spe¬ 
cial  term  describing  a  set  of  constants.  By  setting  a 
limit  on  the  maximum  complexity,  Horn  clauses  can 
be  encoded  as  fixed  length  bitstrings.  A  detailed  de¬ 
scription  of  the  language  used  by  G-NET  can  be  found 
in  (Giordana  and  Neri,  1996;  Giordana  et  ah,  1997). 

G-NET's  inductive  engine  exploits  a  stoi'hastic  algo¬ 
rithm  organized  in  two  levels.  The  lower  level,  named 
Genetic  layer  (G-layer),  searches  for  Horn  clauses 
representing  partial  definitions  (f  of  the  concept  to 
learn.  The  architecture  of  the  G-layer  derives  from 
the  diffusion  model  (Manderik  and  Si)ie.s.seus,  198!)). 
and  integrates  different  ideas  originated  inside  the 
field  of  evolutionary  computation  and  tabu  search 
(Rayward-Smithet  ah,  1989).  The  upper  level,  namely 
the  Supervisor,  builds  up  a  global  disjunctive  defini¬ 
tion  <I>,  out  of  the  partial  definitions  tpi's  generated 
inside  the  G-layer,  using  a  greedy  set  covering  algo¬ 
rithm.  Moreover,  the  Supervisor  interacts  with  the  G- 
layer  according  to  a  coevolutionary  strategy  (Potter 
et  ah,  1995),  which  aims  at  increasing  the  probability 
of  evolving  clauses  useful  to  improve  the  quality  of  the 
disjunctive  concept  description  currently  in  progress. 

From  a  computational  point  of  view,  the  G-layer  con¬ 
sists  of  a  set  of  elementary  searching  nodes  called  G- 
nodes.  Every  G-node,  G,-,  is  associated  with  a  single 


concept  instance  and  executes  a  local  evolutionary 
search  aimed  at  constructing  an  inductive  hypothesis 
covering  f+,  and  having  a  fitness  value  as  high  as  po.s- 
sible.  The  same  instance  e+  can  be  assigned  to  more 
than  one  G-node. 

The  association  between  G-nodes  and  concept  in¬ 
stances  is  dynamically  established  by  the  Supervisor, 
which  decides  what  regions  of  the  hypothesis  space  to 
search,  and  how  much. 

Every  G-node  is  provided  with  a  small  local  memory, 
where  it  stores  the  set  of  current  hypotheses  it  is  work¬ 
ing  on  (local  population).  Basically,  the  search  algo¬ 
rithm  executed  by  a  G-node  resembles  a  simple  Ge¬ 
netic  Algorithm: 

G-node  Search  Algorithm 
repeat 

1.  Select  two  clauses  <pi  and  <fi2  from  the  local  mem¬ 
ory  with  probabilities  proportional  to  their  fitness; 

2.  Create  two  new  clauses  and  ip2,  both  different 
from  ifii  and  <p2', 

3.  Evaluate  <p\  and  ip'2  on  the  learning  set; 

4.  Broadcast  the  new  clauses  to  every  G-node  asso¬ 
ciated  with  some  instance  e"*"  they  correctly  cover; 

until  a  halt  condition  is  reached 

The  outcome  of  the  evaluation  step  is  a  fitne.ss  value 
fti'r)  <  'orresponding  to  the  quality  of  the  clause  ip  (see 
below).  By  generalizing  a  formula  covering  the  associ¬ 
ated  instance  c+,  a  G-node  can  implicitly  generate  for¬ 
mulas  also  covering  other  instances  which  are  assigned 
to  diffi'i-ent  G-nodes.  The  aim  of  the  broadcasting  step 
is  to  ]>ropagate  these  formulas  to  the  G-nodes,  which 
potentially  can  benefit  from  them.  When  a  clau.se  is 
broadcast  to  another  G-node,  it  competes  for  entering 
the  local  memory  by  playing  a  kind  of  stochastic  tour¬ 
nament  (Harik,  1995),  based  on  the  fitness  value  /;, . 
As  the  policy  we  adojA  enforces  diversity  in  the  local 
memories,  a  clause  is  allowed  to  play  the  tournament 
only  if  no  copy  of  it  is  already  there.  At  the  beginning, 
the  population  of  a  G-node  is  initialized  with  only  one 
individual  and  can  grow  up  to  a  maximum  predeter¬ 
mined  size.  The  tournament  step  is  performed  only 
after  the  maximum  size  has  been  reached. 

This  way  of  propagating  inductive  hypotheses  among 
G-nodes  promotes  the  formation  of  families  of  hy¬ 
potheses,  which  cluster  the  concept  instances  into 
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groups,  roughly  corresponding  to  different  modalities 
of  the  target  concept.  From  the  point  of  view  of  evo¬ 
lutionary  computation  this  process  can  be  seen  as  a 
process  of  niches  and  species  formation  (Goldberg  and 
Richardson,  1987). 

The  emergence  of  species  (i.e.  concept  modalities)  is 
the  baseline  for  the  coevolutionary  strategy  adopted 
by  the  Supervisor.  Periodically,  the  Supervisor  (a) 
collects  the  best  representatives  of  each  species,  and 
works  out  a  global  concept  description,  (b)  reassigns 
the  concept  instances  to  the  G-nodes  in  order  to  in¬ 
crease  the  search  efforts  where  emerging  species  still 
correspond  to  low  quality  inductive  hypotheses,  and 
(c)  supplies  a  corrective  term  to  be  added  to  the  fitness 
of  the  inductive  hypotheses  in  the  G-layer,  helping  the 
species  that  better  contribute  to  the  global  solution  to 
develop  further. 

3  THE  FITNESS  EVALUATION 

In  G-NET,  two  different  fitness  functions,  fo  and  /t, 
are  used  in  order  to  evaluate  global  (disjunctive)  and 
local  (conjunctive)  concept  descriptions,  respectively. 
Both  measures  are  based  on  the  Minimum  Descrip>- 
tion  Length  Principle  (Rissanen,  1978).  The  function 
/g($)  is  the  sum  of  three  terms: 

/g($)  =  MDLmax-MDL{€+{^)+€-{^))-MDL{^) 

(1) 

being  MDLmax  the  MDL  of  the  whole  learning  set, 
M DL{e'^  -f  £~($))  the  MDL  of  the  set  e+  of  posi¬ 
tive  concept  instances  not  covered  by  and  of  the  set 
of  negative  instances  e~  covered  by  $,  and  ML>L($) 
the  minimum  description  length  of  the  syntactic  form 
of  4>.  In  turn,  MDL{^)  is  computed  as  the  sum 
MDL{^)  =  Yf,iMDL{<pi)  of  the  MDL  of  the  sin¬ 
gle  clauses  belonging  to  $.  In  all  cases,  the  expres¬ 
sions  for  the  MDL  of  the  different  terms  have  been  ob¬ 
tained  using  Stirling’s  approximation,  as  in  (Oliveira 
and  Sangiovanni-Vincentelli,  1996).  The  definition  of 
/g  has  been  chosen  in  order  to  have  a  function  which 
increases  when  the  MDL  decreases,  because  it  is  eas¬ 
ier  to  transform  it  into  a  probability,  used  to  guide  the 
stochastic  search. 

The  local  fitness  /l  for  evaluating  a  single  clause  <p  in 
a  G-node  takes  the  form: 

fiif)  =  MDLmax  -  MDL{ip)  + 

-MDL{€-{<p))  +  (/g($')  -  /g($))(2) 

being  $  the  current  global  description  constructed  by 
the  Supervisor,  and  $'  the  formula  obtained  by  adding 


9?  to  $  and  eliminating  all  redundant  disjuncts  but  (p. 
In  other  words,  the  second  and  third  term  evaluate 
how  simple  and  consistent  p  is.  The  fourth  term  is 
the  bias  for  enforcing  the  coevolutionary  strategy  and 
evaluates  how  well  p  combines  with  the  other  existing 
partial  descriptions  in  order  to  form  a  global  solution, 
covering  the  instance  e+  associated  to  the  G-node  and 
as  much  as  possible  of  the  other  instances. 

4  THE  COEVOLUTION  STRATEGY 

The  Supervisor  enforces  coevolution  by  means  of  two 
algorithms,  which  are  executed  periodically  at  the  end 
of  a  macro-cycle.  A  macro-cycle  is  measured  by  count¬ 
ing  the  number  of  iterations  of  the  Search  Algorithm 
(/^-cycles)  globally  performed,  in  the  G-layer,  by  the 
G-nodes.  The  first  algorithm  computes  a  global  con¬ 
cept  description  $  out  of  the  best  representatives  of 
the  species  emerged  in  the  G-layer,  and  is  based  on 
a  hill  climbing  optimization  strategy.  At  first,  from 
every  G-node  the  locally  best  hypothesis  is  collected 
and  is  then  merged  into  a  redundant  disjunctive  de¬ 
scription  Then,  is  optimized  by  eliminating  the 
disjuncts,  which  are  not  necessary.  This  is  done  by 
repeating  the  following  cycle  until  $'  reaches  a  final 
form  which  cannot  be  optimized  further: 

1.  Search  the  clause  p  such  that  /g($'  -  p)  shows 
the  greatest  improvement. 

2.  Set 

The  second  algorithm  computes  the  assignment  of  the 
(positive)  concept  instances  to  the  G-nodes.  The  ba¬ 
sic  strategy  consists  in  focusing  the  search  on  the  con¬ 
cept  instances  which  are  covered  by  poor  inductive  hy¬ 
potheses,  without  omitting  to  continue  the  refinement 
of  the  other  hypotheses.  This  is  done  by  balancing  the 
computation  among  the  different  emerging  species,  in 
such  a  way  that  species  covering  smaller  niches  will 
get  the  same  computational  power  as  the  ones  cover¬ 
ing  larger  niches. 

The  Supervisor  keeps  track  of  the  solution  state  of  ev¬ 
ery  positive  instance  e+  G  £1+  (the  set  of  all  positive 
instances),  i.e.,  the  best  solution  found  for  it.  More¬ 
over,  it  also  records  the  number  c,-  of  p-cycles,  related 
to  ej+,  occurred  during  the  past  computation.  The 
kernel  of  the  coevolutionary  control  strategy  is  the 
method  used  for  accounting  the  //-cycles  related  to  ev¬ 
ery  ej  +  .  As  soon  as  clauses  covering  many  examples 
will  begin  to  develop,  we  will  find  spontaneously  born 
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clusters  of  G-nodes  that  elect  the  same  clause  as  cur¬ 
rent  best  hypothesis  in  their  population.  This  can  be 
interpreted  as  a  form  of  implicit  cooperation,  which 
leads  to  tlie  generation  of  a  clause,  representative  of 
the  work  of  all  of  them.  Tlierefore.  the  Supervisor  at¬ 
tributes  to  an  instance  ?;+  all  the  /i-cycles  executed 
by  the  G-nodes  whose  local  memory  contains  a  copy 
of  the  best  clause  attributed  to  e.  +  . 

At  the  end  of  a  macro-cycle,  the  concept  instances  are 
reassigned  to  G-nodes  with  the  goal  of  balancing  tlie 
work  spent  for  every  f,  +  ,  on  the  basis  of  the  num¬ 
ber  Ci  of  /(-cycles.  Let  C  the  maximum  value  for  c, 

(1  <  i  <  |£'^|).  The  Supervisor  computes  for  every 
e,+~thelimount  3.  =  C  -  c,  of  /(-cycles  necessary  to 
balance  the  computational  cost  for  it.  Afterwards,  the 
instances  are  stochastically  assigned  to  G-nodes  with 
probability  proportional  to  3,.  When  a  G-node  G  is 
assigned  to  a  new  instance  e'*',  it  is  restarted.  If  the 
global  description  ^  contains  a  clause  ipe,  covering  e+, 
tpe  is  inserted  in  the  population  of  G-  Otherwise,  it 
will  be  initialized  by  means  of  the  seeding  operator 
described  below. 

5  THE  GENETIC  OPERATORS 

In  the  same  way  as  REGAL,  Cl-NET  represents  Horn 
clauses  as  fixed  length  bitstrings  ((Giordana  et  ah, 
1997));  then,  search  operators  can  be  implemented  as 
in  standard  Genetic  Algorithms  (Goldberg,  1989).  As 
a  matter  of  fact,  G-NET  uses  three  basic  operators: 
seeding,  crossover  and  mutation.  The  seeding  operator 
(Giordana  and  Neri,  1996)  is  used  for  initializing  the 
local  memory  in  the  G-nodes  when  it  is  empty.  When 
called  in  a  G-node  G,-,  it  stochastically  generates  a 
clause  </?,,  which  is  guaranteed  to  cover  the  instance 
c'*’  currently  associated  with  fr,-. 

Crossover  and  mutation  operators  can  be  applied  m 
different  modalities,  depending  upon  the  clauses  they 
are  applied  to,  and  are  guaranteed  to  produce  new  hy¬ 
potheses  different  from  the  parents  (original  classes). 

The  crossover  is  a  combination  of  the  two  point 
crossover  with  a  variant  of  the  uniform  crossover 
(Syswerda,  1989),  modified  in  order  to  perform  ei¬ 
ther  generalization  or  specialization  of  the  hypotheses. 
More  specifically,  the  crossover  operator  can  be  acti¬ 
vated  in  three  different  modalities:  exchanging,  spe¬ 
cializing  and  generalizing,  which  are  stochastically  se¬ 
lected  depending  on  the  consistency  and  completeness 
of  the  selected  clauses.  Given  a  pair  of  clauses 
the  modality  to  use  is  stochastically  decided  in  two 
steps.  In  the  first  step  it  is  decided  whether  to  apply 


the  exchanging  modality,  w-ith  probability  pec  (Ly  de¬ 
fault  74<-  =  0.1),  or  to  proceed  to  the  second  step,  with 
probability  l-ptc-  Afterwards,  if  the  second  step  is  en¬ 
tered.  the  system  derides  whether  to  apply  generaliza- 
lion  or  specialization  to  each  one  of  the  parent  clauses. 
Let  <p,  be  one  of  the  parents;  the  probability  p,jc(‘Pi), 
of  using  generalization,  and  Pse(fi)  —  1  “  Pgei^Pi)^  of 
using  specialization,  arc  computed  according  to  the 
rule: 

Pgci'Pi)  -  (f"  (/?.  )/(»?!■'■  ) 

being  the  number  of  positive  instance  correctly 
classified  by  ipi  and  e~  the  number  of  negative  in¬ 
stances,  as  previously  defined.  Afterw’ards,  if  the 
same  modality  has  been  chosen  for  both  operands,  the 
crossover  will  be  applied  w-ith  this  modality.  Other¬ 
wise,  if  the  modalities  are  discordant,  the  exchanging 
modality  will  be  used. 

In  this  way,  the  generalizing  modality  tends  to  be  used 
when  the  parents  are  both  consistent,  the  specializing 
modality  when  the  parents  are  both  inconsistent,  and 
the  exchanging  modality  when  one  is  consistent  and 
the  other  is  inconsistent.  The  first  decision  step  guar¬ 
antees  that  an  assigned  percentage  of  pure  information 
exchange  takes  place  in  any  case. 

In  order  to  guarantee  the  actual  exchange  of  infor¬ 
mation,  the  crossover  algorithm  first  constructs  an  in¬ 
dex  /  =  {('i ,  fz,  •  •  • ,  *n}  of  pointers  to  the  positions  in 
the  bitstring  where  the  corresponding  bits  in  the  two 
parents  have  different  values.  Afterw-ards,  if  general¬ 
ization/specialization  has  bec'ii  chosen,  tw-o  temporary 
clauses  i,'j  and  Gj.  identical  to  <pi  and  respectively, 
are  created. 

Then,  for  every  element  ij  G  /  the  following  procedure 
is  reiieali'd: 

•  if  generalizing  modality  has  been  chosen  then 
with  probability />,;  replace  in  V'l  and  i/’z  fhe  value 
of  the  bit  6(?,)  with  the  logical  or  of  the  corir- 
sponding  bits  in  the  operands  p]  and 

•  if  specializing  modality  has  been  chosen  then 
with  probability  Pu  replace  in  V'l  and  V'2  the  value 
of  the  bit  b{ij)  with  the  logical  and  of  the  corre¬ 
sponding  bits  in  ipi  and  (p2- 

If,  after  applying  this  stochastic  procedure,  no  bit  has 
been  changed,  one  bit  chosen  at  random  in  I  is  gener¬ 
alized/specialized. 
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When  the  exchanging  modality  is  chosen,  the  classical 
two-point  crossover  is  applied,  with  the  difference  that, 
in  order  to  guarantee  an  information  exchange,  the 
two  crossover  points  are  chosen  on  the  index  vector  / 
instead  of  on  the  whole  bitstring. 

The  mutation  operator  adopts  a  strategy  similar  to  the 
one  described  so  far  for  crossover,  and  tries  to  general¬ 
ize  or  to  specialize  an  individual,  depending  on  its  con¬ 
sistency  or  inconsistency.  Also  the  mutation  operator 
can  have  three  modalities,  namely  seeding,  generaliz¬ 
ing  or  specializing,  which  are  selected  with  probability 
Pseed  (by  default  Pseed  ~  0.1),  Pgm  and  Pj„j,  respec¬ 
tively.  The  probabilities  pgm  for  generalizing  mutation 
and  psm  for  specializing  mutation  are  computed  with 
the  rule: 

Psm  ~  (l~Pieed)(f  /(f  T 

Pgm  —  1  Pseed  Psmt  (4) 

where  the  argument  of  and  m+  has  been  omitted 
for  brevity.  If  the  specializing  mutation  is  chosen,  the 
mutation  is  applied  as  follows:  let  ni  be  the  number 
of  bits  set  to  “1”  in  the  bitstring;  then,  the  mutation 
operator  turns  to  “0”  a  fraction  7  of  them,  which  is  ob¬ 
tained  by  randomly  selecting  a  real  number  in  the  in¬ 
terval  [0,  ni/10].  The  bits  to  be  set  to  “1”  are  selected 
in  an  analogous  way,  when  the  generalizing  mutation 
is  chosen. 

It  is  easy  to  recognize  that  generalizing  and  specializ¬ 
ing  mutations  are  nothing  else  than  the  dropping  con¬ 
dition  and  adding  condition  operators  defined  in  (Jong 
et  ah,  1993). 

In  the  cycle  executed  by  each  G-node,  two  clauses  are 
selected  at  each  iteration  with  probability  proportional 
to,  their  fitness  fi.  If  the  population  is  empty,  a  new 
individual  will  be  created  using  the  seeding  operator. 
Otherwise,  if  the  two  selected  clauses  ipi  and  (p-i  are 
different,  crossover  is  applied.  On  the  contrary,  if  the 
same  clause  is  selected  two  times,  two  new  clauses  are 
created  using  mutation. 

The  nice  aspect  of  this  strategy  is  that  it  automatically 
adapts  to  the  composition  of  the  population.  When 
the  population  in  a  node  is  dominated  by  a  clause  that 
has  a  fitness  much  higher  than  the  others  (and,  then, 
it  is  frequently  selected  for  reproduction  with  itself), 
the  search  turns  into  a  stochastic  hill  climbing. 

6  EXPERIMENTAL  EVALUATION 

In  the  following  we  present  an  extensive  evaluation 
of  G-NET  made  on  a  variety  of  datasets,  selected 


with  the  aim  of  testing  the  system  with  respect  to 
the  dimensions  mentioned  in  Section  1:  language  bias, 
robustness  to  evaluation  criteria,  and  overall  perfor¬ 
mance.  The  parameters  to  tune  are  actually  very  few 
(the  genetic  operators  constants  are  not  user  tunable) 
and  correspond  to  the  local  population  size  Pg ,  the 
macro-cycle  size  Me  and  the  number  of  G-nodes  Ng. 
In  all  the  previous  experimentation  done,  they  did  not 
appeared  to  be  critical  at  all  and  the  following  setting 
has  been  chosen  as  a  default:  Pg  —  10,  Me  —  300,  Ng  = 
100.  The  results  reported  in  the  following  have  been 
obtained  using  the  default  setting. 

Table  1  reports  a  first  group  of  results  on  datasets  used 
to  test  the  system  Smog  (Oliveira  and  Sangiovanni- 
Vincentelli,  1996),  which  exploits  the  MDL  as  hy¬ 
pothesis  evaluation  criterion.  Results  by  C4.5  are 
used  as  a  baseline.  The  performance  for  Smog  and 
C4.5  are  those  reported  in  (Oliveira  and  Sangiovanni- 
Vincentelli,  1996);  Smog  used  many  other  datasets, 
but  only  some  of  them  are  available  at  the  U.C.  Irvine 
repository  (Merz  et  ah,  1991). 

G-NET  has  always  been  run  with  a  set  of  100  G-nodes 
and  has  been  stopped  after  creating  40000  hypotheses. 
The  specific  goal  of  the  test  was  twofold:  to  investi¬ 
gate  how  G-NET  is  affected  by  changing  its  evalua¬ 
tion  criterion,  all  the  rest  being  the  same,  and  whether 
the  MDL  could  still  be  effective  when  coupled  with  a 
stochastic  search  bias,  such  as  the  one  provided  by  G- 
NET,  very  different  from  the  ones  used  in  Smog  and 
in  C4.5.  The  answer  has  been  positive  in  both  cases. 

By  considering  the  results  on  the  Monk-2  dataset,  the 
effectiveness  of  G-NET’s  species  formation  mechanism 
is  evident:  the  system  always  found  26  disjuncts,  some¬ 
times  the  correct  ones  and  sometimes  little  different; 
this  explains  the  small  error  of  the  acquired  knowledge 
base.  The  species  formation  stability  is  also  confirmed 
by  the  fact  that  in  all  cases  G-NET  found  the  same 
number  of  disjuncts,  differing  for  small  variations. 

Table  2  reports  results  of  an  experiment  aimed  at  veri¬ 
fying  the  utility  of  increasing  the  computational  power 
of  the  search  when  approaching  a  more  large  and  diffi¬ 
cult  problem.  The  dataset  used  is  the  Splice  Junctions 
dataset  (Towell  and  Shavlik,  1994).  The  task  is  that 
of  identifying  boundaries  between  coding  (exons)  and 
non-coding  (introns)  regions  of  genes  occurring  in  eu¬ 
karyote  DNA. 

The  Splice  Junctions  dataset  has  been  previously  used 
to  test  the  system  REGAL,  which  presented  the  best 
results  so  far  among  the  many  reported  in  the  liter¬ 
ature  (Neri  and  Saitta,  1996).  While  increasing  the 
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Table  1;  Comparison  Between  G-NET,  Smog  And  C4.5  With  Respect  To  Tlie  Average  Error  Rate  Of  The  So¬ 
lution,  Evaluated  With  The  10-fold  Crossvalidalion 


Probleui 

Datased,  size 

Av<uag(i  Error 

% 

Average  N. 
of  Dlsjuiicts 

G-NET 

Smog 

C4.5 

G-NET 

monkl 

432  10-fold 

0.00  ±0.00 

0.00  ±0.00 

0.00  ±0.00 

3.0 

monk2 

432  10-fold 

2.80  ±3.80 

0.00  ±0.00 

32.83  ±  10.66 

26.0 

monk3 

432  10-fold 

0.00  ±0.00 

0.00  ±0.00 

0.00  ±0.00 

3.0 

tictactoe 

958  10-fold 

0.97  ±  0.62 

2.82  ±  1.97 

7.07  ±  1.82 

10.5 

credit 

690  10-fold 

15.8  ±4.40 

19.57  ±5.08 

14.03  ±3.28 

14.0 

breast 

699  10-fold 

5.29  ±2.89 

6.72  ±  2.44 

5.85  ±3.32 

2.6 

vote 

435  10-fold 

5.10  ±3.20 

5.29  ±2.64 

4.63  ±3.05 

2.0 

Table  2:  Comparison  Between  G-NET  And  REGAL  With  Respect  To  The  Average  Error  Rate  And  The  Com¬ 
plexity  Of  The  Solution 


Problem 

Dataset  size 

Average  Error  % 

N.  of  Disjimcts 

G-NET 

REGAL 

G-NET 

REGAL 

splice-j.  (El) 

2000  ±  1190 

3.40 

4.40 

7 

19 

splice-j.  (IE) 

2000  ±  1190 

2.90 

4.20 

10 

26 

splice-j.  (Neither) 

2000  ±  1190 

3.30 

5.20 

11 

21 

mushrooms 

4000  ±4124 

0.00 

0.00 

3 

6 

system  parallelism  and  decreasing  the  complexity,  G- 
NET  achieved  even  lower  average  error  rate  (error  rate 
is  an  average  over  3  runs).  The  second  best  results 
were  achieved  by  KBANN  (Towell  and  Shavlik,  1994); 
7.56%  for  El,  8.47%.  for  IE,  and  4.62%^  for  Neither. 
This  comparison  suggests  that  genetic  search  could  be 
better  suited  to  comjilex  ])roblems. 

Finally,  Table  3  reports  the  resvdts  of  experiments 
ainu'd  at,  coiifirmiug  that  G-NE'l  (as  its  predecessor 
REGAL)  is  able  t,o  effectively  deal  with  more  complex 
languages,  such  as  predicate  logic  based  ones. 

The  first  row  in  Table  3  refers  to  the  mutarjrncsis 
dataset,  a  challenging  problem  widely  used  in  the  ILP 
community  for  testing  induction  algorithms  in  First 
Order  Logic  (King  et  ah,  1995).  The  problem  consists 
in  learning  rules  for  discriminating  substances  having 
cancerogenetic  properties  on  the  basis  of  their  chemical 
structure.  The  difficulty  lies  mainly  in  the  complexity 
of  matching  formulas  in  First  Order  Logic,  which  limits 
the  exploration  capabilities  of  any  induction  system. 
To  our  knowledge,  the  best  results  with  this  database 
have  been  obtained  by  STILL  (Sebag,  1997)  a  stochas¬ 
tic  induction  algorithm  that  easily  reaches  error  rates 
below  10%,  and,  with  a  careful  setting  of  the  control 


parameters,  made  the  best  hit  at  6.4%..  Many  other 
systems,  going  from  Linear  Regression  to  PROGOL 
and  FOIL,  reported  error  rates  ranging  between  11%. 
and  14%i.  G-NET,  using  only  the  predicates  used  in 
(Sebag,  1997),  obtained  an  error  rate  of  8.8%.. 

The  second  case  study  is  a  classification  problem  (Es¬ 
posito  et  ah,  1992)  of  documents  acquired  through  a, 
scanner,  and  processed  by  an  image  proce.ssing  pro¬ 
gram  (hat  produce's  a  structured  ch'seription  of  the  lay¬ 
out.  The  dataset  contains  struct, ured  data  described 
with  5  symbolic  and  3  numeric  attributes,  and  has 
been  used  to  test  learners  with  the  capability  of  deal¬ 
ing  with  numerical  features  in  FOL  (Esposito  et,  ah, 
1992:  Botta  and  Giordana,  1993).  G-NET  does  not 
have,  at  the  moment,  any  specific  strategy  for  deal¬ 
ing  with  numerical  features,  and  so  we  transformed 
the  problem  into  a  symbolic  one  by  discretizing  the 
numeric  features.  Each  numeric  feature  has  been  dis¬ 
cretized  by  subdividing  the  range  into  16  equal  length 
intervals.  G-NET  easily  reached  an  error  rate  below 
the  1%,  approximately  the  same  as  SMART -|-  which 
has  specific  strategies  for  dealing  with  numerical  fea¬ 
tures. 

Finally,  the  last  case  study  {Train  Chcckout-3)  is  a 


An  Experimental  Evaluation  of  Coevolutive  Concept  Learning  25 


Table  3:  Experiments  With  First  Order  Problems.  Error  Rate  For  The  Tram  Check-out  3  Is  An  Average  Of  3 
Runs 


Problem 

Dataset  size 

Average  Error 

% 

N.  of  Disjimcts 

G-NET 

STILL 

SMART  ± 
FONNs 

G-NET 

mutagenesis 

230  10-fold 

8.80  ±7.90 

6.4  ±4.5 

n.a. 

3 

office-doc 

210  4-160 

0.89  ±0.72 

n.a. 

0.80 

11 

train  check-out  3 

500  -b  6000 

11.3  ±0.47 

n.a. 

16.8 

2 

difficult  artificial  dataset  generated  for  testing  FONNs 
(Botta  et  al.,  1997),  a  kind  of  neural  network  recently 
proposed  for  refining  numerical  terms  in  Horn  Clauses. 
The  dataset  contains  the  description  of  a  set  of  trains, 
similar  to  the  one  proposed  by  Michalski,  where  each 
coach  is  described  by  means  of  a  set  of  5  symbolic  and 
4  numerical  attributes.  In  (Botta  et  al.,  1997)  three 
different  learning  problems  of  increasing  difficulty  are 
presented,  related  to  this  dataset.  The  problems  con¬ 
sist  in  learning  sets  of  rules  for  assessing  when  a  train 
meets  the  safety  conditions  required  for  travelling  on 
a  given  line.  The  one  we  considered  here  is  the  most 
difficult  among  them  and  the  challenge  is  to  discover 
the  rule  used  for  classifying  the  concept  instances: 


a  train  cannot  go  if  it  contains  two  near 
cars,  both  without  brakes  and  heavier 
than  a  threshold  wes  or  if  it  contains 
two  near  cars  carrying  an  unstable  load 
(special  material)  and  heavier  than  a 
threshold  we^  <  wes,. 


FONNs  could  easily  reach  an  error  rate  below  2%  on  a 
test  set  of  6000  instances  starting  from  a  handcrafted 
knowledge  base,  which  correctly  described  the  struc¬ 
ture  of  the  rule  hidden  in  the  data,  but  only  reached 
an  error  rate  of  about  17%  starting  from  a  set  of  rules 
learned  by  SMART-)-  from  500  learning  instances.  Re¬ 
shaping  the  problem  in  propositional  calculus,  C4.5 
and  CART  could  not  go  below  an  error  rate  of  27%, 
and  neural  networks  such  as  multi-layer  perceptron 
and  cascade  correlation  where  performing  even  more 
poorly  (Botta  et  al.,  1997). 

G-NET  has  been  run  by  discretizing  every  numeric 
attribute  into  a  range  of  30  intervals.  As  it  appears 
from  the  last  row  in  Table  3,  it  was  able  to  find  two 
clauses  which  show  an  error  rate  around  11%. 


7  DISCUSSION 

As  it  appears  from  the  results  reported  above,  G-NET 
is  a  very  flexible  system,  able  to  deal  with  many  dif¬ 
ferent  problems,  producing  good  results.  Moreover,  as 
already  stated,  the  results  have  been  obtained  with¬ 
out  performing  any  specific  tuning,  so  that  the  system 
proved  to  be  quite  robust  and  easy  to  use.  This  looks 
surprising  considering  that  a  major  complain  against 
GAs  is  the  difficulty  of  tuning  parameters. 

We  point  out  that,  in  spite  of  its  architecture  strongly 
resembling  a  Genetic  Algorithm,  G-NET  cannot  be 
considered  a  classical  GA,  because  the  principles  which 
control  the  evolution  are  substantially  different.  In  our 
opinion,  two  aspects  determine  the  success  of  G-NET: 
the  enforcement  of  diversity  in  the  local  populations 
and  the  coevolution. 

In  their  basic  formulation  GAs  use  genetic  pressure, 
i.e.  the  capability  of  the  most  fit  individuals  to  re¬ 
produce  more  quickly,  so  that  the  weakest  ones  are 
eliminated  from  the  population.  This  mechanism  has 
the  positive  effect  of  focusing  the  search  on  the  most 
fit  individuals,  so  that,  in  the  best  case,  the  algorithm 
will  climb  up  a  maximum  of  the  fitness  function.  Un¬ 
fortunately,  the  mechanism  is  unstable  and  a  too  quick 
convergence  prevents  reaching  optimal  solutions.  An¬ 
other  drawback  is  that,  in  this  way,  many  identical 
individuals  will  be  present  in  the  population,  so  that 
the  search  can  become  ineffective  because  the  major 
search  operator  (crossover)  reproduces  again  and  again 
the  same  individuals. 

A  trend  in  the  GA  literature,  which  at  least  par¬ 
tially  relieves  this  problem,  is  related  to  the  theory 
of  species  and  niches  formation.  Species  formation 
can  be  promoted  in  many  ways  by  limiting  the  ge¬ 
netic  pressure  between  species  (Goldberg  and  Richard¬ 
son,  1987).  Species  formation  offers  some  benefits, 
such  as  the  possibility  of  restricting  crossover  to  the 
individuals  of  the  same  species  (crossover  among  dif- 
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ferent  species  is  essentially  deceptive),  increasing  the 
search  effectiveness  and  allowing  the  disco\'erv  of  mul¬ 
tiple  modalities.  For  instance,  in  (l-NFT.  as  well  as 
in  RECAL.  this  has  been  e.xjjloited  for  learning  dis¬ 
junctive  concept  descriptions.  However,  even  in  this 
framework,  genetic  pressure  continues  to  be  used  in¬ 
side  a  same  species  as  a  mechanism  of  focus  of  atten¬ 
tion.  Requiring  that  a  population  (the  local  memory 
of  G-nodes)  contains  individuals  (clauses)  all  different, 
is  a  definite  departure  from  this  mechanism,  and  dras¬ 
tically  limits  any  form  of  genetic  pre.ssure.  Therefore, 
the  algorithm  becomes  much  more  stable  and  less  sen¬ 
sitive  of  crossover  type,  and  of  crossover  and  mutation 
rates.  Furthermore,  in  the  place  of  genetic  pressure 
other  strategies,  tailored  to  the  specific  task,  can  be 
used  for  guiding  the  search.  In  our  case,  the  coevolu¬ 
tion  is  now  the  major  strategy  that  focuses  the  search 
where  it  is  necessary  instead  of  letting  it  follow  the 
stream  enforced  by  the  genetic  pressure.  A  second 
component  is  represented  by  the  local  search  opera¬ 
tors,  which  are  context  sensitive  and  make  the  best 
effort  in  order  to  increase  the  exploration  capability  of 
the  algorithm. 

Both  the  idea  of  maintaining  the  population  diversity 
and  the  one  of  coevolution  originated  before  G-NET, 
whose  originality  consists  in  the  adaptation  to  the  spe¬ 
cific  task  and  to  the  integration  of  these  ideas  into  a 
unique  framework.  On  the  one  hand,  diversity  in  GAs 
has  been  already  proposed  by  several  authors  (Augier 
et  ah,  1995),  although  no  one  speetdates  on  the  rea¬ 
sons  why  a  GA  should  benefit  from  it.  On  the  other 
hand,  diversity  could  be  related  to  tabu  search.  The  lo¬ 
cal  memory  of  a  G-nodes  works  as  an  elementary  tabu 
list  which  prevents  the  algorithm  from  reprocessing  al¬ 
ready  generated  instances  without  an  explicit  will  to 
do  so. 

Coevolution  appeared  inside  the  G.\  community  sev¬ 
eral  years  ago  (Husbands  and  Mill.  1991),  and  has  bec'ii 
considered  by  few  others  in  the  following.  The  coevo¬ 
lution  model,  described  here,  conforms  to  the  one  pro¬ 
posed  by  (Potter  et  ah,  1995),  properly  re-interpreted 
in  the  framework  of  concejA  learning,  which  naturally 
conforms  to  it. 

Finally,  the  reassignment  of  the  examples  to  be  cov¬ 
ered  to  G-nodes,  performed  by  the  Supervisor,  can  be 
considered  a  kind  of  boosting  (Shapire,  1990);  in  sub¬ 
sequent  runs,  the  search  efforts  shall  be  concentrated 
on  those  parts  of  the  hypothesis  space  not  yet  ade¬ 
quately  covered.  Currently,  the  series  of  found  hy¬ 
potheses  are  combined  into  a  unique  formula,  w'hich 
differentiate  this  approach  from  a  genuine  boosting. 


However,  nothing  hinders  the  Supervisor  from  keeping 
apart  the  hypotheses  and  using  them  according  to  a 
majority  voting  classification  strategy,  instead  of  com¬ 
bining  them.  This  possibility  has  not  been  explored 
yet. 

8  CONCLUSIONS 

In  this  paper  we  presented  a  new  induction  system 
ba.sed  on  an  evolutionary  approach,  which  is  the  out¬ 
come  of  several  years  of  investigation  in  this  direction. 

Given  the  good  results  obtained  across  a  variety  of 
data.sets,  languages,  and  evaluation  criteria,  it  should 
be  evident  that  a  system  like  G-NET  can  be  profitably 
used  to  explore  the  structure  of  new  learning  problems, 
when  little  a  priori  information,  clearly  pointing  to 
another  approach,  is  available. 

Moreover,  thanks  to  its  computational  model,  G-NET 
is  able  to  effectively  exploit  parallel  computing  sys¬ 
tems,  allowing  to  deal  with  large  and  complex  datasets. 
As  a  matter  of  fact,  in  addition  to  the  possibility  of 
distributing  the  search  among  many  G-nodes,  G-NET 
offers  also  the  possibility  of  distributing  the  hypotheses 
evaluation  on  several  processors.  Although  this  aspect 
has  not  been  described  here,  because  it  is  outside  the 
scope  of  the  paper,  the  current  implementation  of  G- 
NET  runs  on  a  cluster  of  workstations  (Anglano  et  ah, 
1997).  This  facility  has  been  extensively  exploited  for 
the  experiments  on  Mutngrncsi.‘:  and  Splirr  dwirtionn 
datasets,  so  that  the  results  for  every  run  have  been 
obtained  in  a  few  hours. 

The  conclusion  is  that  G-NET  seems  to  be  very  well- 
suited  to  learning  structured  concepts,  such  as  the  ones 
typically  learned  by  ILP  methods,  and.  in  addition,  to 
face  learning  |)roblems  on  large  databases. 
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Abstract 

In  this  paper  we  present  TDLeaf(A),  a  varia¬ 
tion  on  the  TD(A)  algorithm  that  enables  it  to 
be  used  in  conjunction  with  game-tree  search. 

We  present  some  experiments  in  which  our 
chess  program  “KnightCap”  used  TDLeaf(A) 
to  learn  its  evaluation  function  while  play¬ 
ing  on  the  Free  Internet  Chess  Server  (FICS, 
fics.onenet.net).  The  main  success  we  re¬ 
port  is  that  KnightCap  improved  from  a  1650  rat¬ 
ing  to  a  2150  rating  in  just  308  games  and  3  days 
of  play.  As  a  reference,  a  rating  of  1650  corre¬ 
sponds  to  about  level  B  human  play  (on  a  scale 
from  E  (1000)  to  A  (1800)),  while  2150  is  human 
master  level.  We  discuss  some  of  the  reasons  for 
this  success,  principle  among  them  being  the  use 
of  on-line,  rather  than  self-play. 

1  Introduction 

Temporal  Difference  learning,  first  introduced  by  Samuel 
[5]  and  later  extended  and  formalized  by  Sutton  [7]  in  his 
TD(A)  algorithm,  is  an  elegant  technique  for  approximat¬ 
ing  the  expected  long  term  future  cost  (or  cost-to-go)  of  a 
stochastic  dynamical  system  as  a  function  of  the  current 
state.  The  mapping  from  states  to  future  cost  is  imple¬ 
mented  by  a  parameterized  function  approximator  such  as 
a  neural  network.  The  parameters  are  updated  online  af¬ 
ter  each  state  transition,  or  possibly  in  batch  updates  after 
several  state  transitions.  The  goal  of  the  algorithm  is  to  im¬ 
prove  the  cost  estimates  as  the  number  of  observed  state 
transitions  and  associated  costs  increases. 

Perhaps  the  most  remarkable  success  of  TD(A)  is  Tesauro’s 
TD-Gammon,  a  neural  network  backgammon  player  that 
was  trained  from  scratch  using  TD(A)  and  simulated  self¬ 
play.  TD-Gammon  is  competitive  with  the  best  human 


backgammon  players  [9].  In  TD-Gammon  the  neural  net¬ 
work  played  a  dual  role,  both  as  a  predictor  of  the  expected 
cost-to-go  of  the  position  and  as  a  means  to  select  moves. 
In  any  position  the  next  move  was  chosen  greedily  hy  eval¬ 
uating  all  positions  reachable  from  the  current  state,  and 
then  selecting  the  move  leading  to  the  position  with  small¬ 
est  expected  cost.  The  parameters  of  the  neural  network 
were  updated  according  to  the  TD(A)  algorithm  after  each 
game. 

Although  the  results  with  backgammon  are  quite  striking, 
there  is  lingering  disappointment  that  despite  several  at¬ 
tempts,  they  have  not  been  repeated  for  other  board  games 
such  as  Othello,  Go  and  the  “drosophila  of  AI”  —  chess 
[10, 12,  6]. 

Many  authors  have  discussed  the  peculiarities  of  backgam¬ 
mon  that  make  it  particularly  suitable  for  Temporal  Dif¬ 
ference  learning  with  self-play  [8,  6,  4].  Principle  among 
these  are  speed  of  ploy:  TD-Gammon  learnt  from  sev¬ 
eral  hundred  thousand  games  of  self-play,  representation 
smoothness:  the  evaluation  of  a  backgammon  position 
is  a  reasonably  smooth  function  of  the  position  (viewed, 
say,  as  a  vector  of  piece  counts),  making  it  easier  to  find 
a  good  neural  network  approximation,  and  stochasticity: 
backgammon  is  a  random  game  which  forces  at  least  a  min¬ 
imal  amount  of  exploration  of  search  space. 

As  TD-Gammon  in  its  original  form  only  searched  one- 
ply  ahead,  we  feel  this  list  should  be  appended  with:  shal¬ 
low  search  is  good  enough  against  humans.  There  are  two 
possible  reasons  for  this;  either  one  does  not  gain  a  lot 
by  searching  deeper  in  backgammon  (questionable  given 
that  recent  versions  of  TD-Gammon  search  to  three-ply 
and  this  significantly  improves  their  performance),  or  hu¬ 
mans  are  simply  incapable  of  searching  deeply  and  so  TD- 
Gammon  is  only  competing  in  a  pool  of  shallow  searchers. 
Although  we  know  of  no  psychological  studies  investigat¬ 
ing  the  depth  to  which  humans  search  in  backgammon,  it 
is  plausible  that  the  combination  of  high  branching  fac- 
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tor  and  random  move  generation  makes  it  quite  difficult  to 
search  more  than  one  or  two-ply  ahead.  In  particular,  ran¬ 
dom  move  generation  effectively  prevents  selective  search 
or  “forward  pruning”  because  it  enforces  a  lower  bound  on 
the  branching  factor  at  each  move. 

In  contrast,  finding  a  representation  for  chess,  Othello  or 
Go  which  allows  a  small  neural  network  to  order  moves  at 
one-ply  with  near  human  performance  is  a  far  more  diffi¬ 
cult  task.  It  seems  that  for  these  games,  reliable  tactical 
evaluation  is  difficult  to  achieve  without  deep  lookahead. 
As  deep  lookahead  invariably  involves  some  kind  of  mini¬ 
max  search,  which  in  turn  requires  an  exponential  increase 
in  the  number  of  positions  evaluated  as  the  search  depth 
increases,  the  computational  cost  of  the  evaluation  func¬ 
tion  has  to  be  low,  ruling  out  the  use  of  expensive  evalua¬ 
tion  functions  such  as  neural  networks.  Consequently  most 
chess  and  othello  programs  use  linear  evaluation  functions 
(the  branching  factor  in  Go  makes  minimax  search  to  any 
significant  depth  nearly  infeasible). 

In  this  paper  we  introduce  TDLeaf(A),  a  variation  on  the 
TD(A)  algorithm  that  can  be  used  to  learn  an  evaluation 
function  for  use  in  deep  minimax  search.  TDLeaf(A)  is 
identical  to  TD(A)  except  that  instead  of  operating  on  the 
positions  that  occur  during  the  game,  it  operates  on  the  leaf 
nodes  of  the  principal  variation  of  a  minimax  search  from 
each  position  (also  known  as  the  principal  leaves). 

To  test  the  effectiveness  of  TDLeaf(A),  we  incorporated  it 
into  our  own  chess  program — KnightCap.  KnightCap  has 
a  particularly  rich  board  representation  enabling  relatively 
fast  computation  of  sophisticated  positional  features,  al¬ 
though  this  is  achieved  at  some  cost  in  speed  (KnightCap  is 
about  10  times  slower  than  Crafty — the  best  public-domain 
chess  program — and  6,000  times  slower  than  Deep  Blue). 
We  trained  KnightCap’s  linear  evaluation  function  using 
TDLeaf(A)  by  playing  it  on  the  Free  Internet  Chess  Server 
(FICS,  fics.onenet.net)  and  on  the  Internet  Chess 
Club  (ICC,  chessclub.com).  Internet  play  was  used 
to  avoid  the  premature  convergence  difficulties  associated 
self-play '.The  main  success  story  we  report  is  that  starting 
from  an  evaluation  function  in  which  all  coefficients  were 
set  to  zero  except  the  values  of  the  pieces,  KnightCap  went 
from  a  1 650-rated  player  to  a  2 1 50-rated  player  in  just  three 
days  and  308  games.  KnightCap  is  an  ongoing  project  with 
new  features  being  added  to  its  evaluation  function  all  the 
time.  We  use  TDLeaf(A)  and  Internet  play  to  tune  the  co¬ 
efficients  of  these  features. 


‘Randomizing  move  choice  is  another  way  of  avoiding  prob¬ 
lems  associated  with  self-play  (this  approach  has  been  tried  in  Go 
[6]),  but  the  advantage  of  the  Internet  is  that  more  information  is 
provided  by  the  opponents  play. 


The  remainder  of  this  paper  is  organized  as  follows.  In 
section  2  we  describe  the  TD(A)  algorithm  as  it  applies  to 
games.  The  TDLeaf(A)  algorithm  is  described  in  section  3. 
Experimental  results  for  internet-play  with  KnightCap  are 
given  in  section  4.  Section  5  contains  some  discussion  and 
concluding  remarks. 

2  The  TD(A)  algorithm  applied  to  games 

In  this  section  we  describe  the  TD(A)  algorithm  as  it  applies 
to  playing  board  games.  We  discuss  the  algorithm  from  the 
point  of  view  of  an  agent  playing  the  game. 

Let  S  denote  the  set  of  all  possible  board  positions  in  the 
game.  Play  proceeds  in  a  series  of  moves  at  discrete  time 
steps  t  =  1,2,....  At  time  t  the  agent  finds  itself  in 
some  position  Xt  G  5,  and  has  available  a  set  of  moves, 
or  actions  (the  legal  moves  in  position  xt).  The  agent 
chooses  an  aetion  a  €  A®,  and  makes  a  transition  to  state 
xt+i  with  probability  p{xt,xt+i ,  a).  Here  xt+i  is  the  po¬ 
sition  of  the  board  after  the  agent’s  move  and  the  oppo¬ 
nent’s  response.  When  the  game  is  over,  the  agent  receives 
a  scalar  reward,  typically  “1”  for  a  win,  “0”  for  a  draw  and 
“-1”  for  a  loss. 

For  ease  of  notation  we  will  assume  all  games  have  a  fixed 
length  of  N  (this  is  not  essential).  Let  r(xjv)  denote  the  re¬ 
ward  received  at  the  end  of  the  game.  If  we  assume  that  the 
agent  chooses  its  actions  according  to  some  function  a{x) 
of  the  current  state  x  (so  that  a{x)  G  Aj,),  the  expected 
reward  from  each  state  a;  G  5  is  given  by 

J*(a;)  ;=  £;^^|,,r(a;jv),  (1) 

where  the  expectation  is  with  respect  to  the  transition  prob¬ 
abilities  p{xt,xt+i,a{xt))  and  possibly  also  with  respect 
to  the  actions  a{xt)  if  the  agent  chooses  its  actions  stochas¬ 
tically. 

For  vary  large  state  spaces  5  it  is  not  possible  store  the 
value  of  J*  (x)  for  every  x  G  5,  so  instead  we  might  try 
to  approximate  J*  using  a  parameterized  function  class 
J :  5  X  E*  H,  for  example  linear  function,  splines,  neu¬ 
ral  networks,  etc.  J{-,w)  is  assumed  to  be  a  differentiable 
function  of  its  parameters  w  —  (wi  ,...,Wk).  The  aim  is  to 
find  a  parameter  vector  u;  G  M*  that  minimizes  some  mea¬ 
sure  of  error  between  the  approximation  J{-,w)  and 
The  TD(A)  algorithm,  which  we  describe  now,  is  designed 
to  do  exactly  that. 

Suppose  xi , . . . ,  xat-i  ,  xjv  is  a  sequence  of  states  in  one 
game.  For  a  given  parameter  vector  w,  define  the  temporal 
difference  associated  with  the  transition  xt  xt+i  by 

dt'.- J{xt+i,'w)- J{xt,w).  (2) 
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Note  that  dt  measures  the  difference  between  the  reward 
predicted  by  j(-,  tu)  at  time  f  +  1,  and  the  reward  predicted 
by  J{-,w)  at  time  t.  The  true  evaluation  function  J*  has 
the  property 

[r{xt+i)-r{xt)]  =  o, 

so  if  J{-,w)  is  a  good  approximation  to  J*,  Ex,^^\x,dt 
should  be  close  to  zero.  For  ease  of  notation  we  will  assume 
that  J{xn,'w)  =  r{xN)  always,  so  that  the  final  temporal 
difference  satisfies 


Successive  parameter  updates  according  to  the  TD(A)  al¬ 
gorithm  should,  over  time,  lead  to  improved  predictions  of 
the  expected  reward  J{-,w).  Provided  the  actions  a{xt) 
are  independent  of  the  parameter  vectors,  it  can  be  shown 
that  for  linear  J(-,  lu),  the  TD(A)  algorithm  converges  to  a 
near-optimal  parameter  vector  [1 1],  Unfortunately,  there  is 
no  such  guarantee  if  J{-,  w)  is  non-linear  [1 1],  or  if  a{xt) 
depends  on  w  [2]. 

3  Minimax  Search  and  TD(A) 


djv_i  =  J{XN,w)-J{XN-l,w)  =  r{xN)-J{xN-i,w). 

That  is,  djv-i  is  the  difference  between  the  true  outcome 
of  the  game  and  the  prediction  at  the  penultimate  move. 

At  the  end  of  the  game,  the  TD(A)  algorithm  updates  the 
parameter  vector  w  according  to  the  formula 


For  argument’s  sake,  assume  any  action  a  taken  in  state  x 
leads  to  predetermined  state  which  we  will  denote  by  x'^. 
Once  an  approximation  to  J*  has  been  found,  we 

can  use  it  to  choose  actions  in  state  x  by  picking  the  action 
a  £  Ax  whose  successor  state  x'^  minimizes  the  opponent’s 
expected  reward^; 


N-l 

w  :=  w  +  a  ^  WJ{xt,w) 
t=i 


N-l 

E 


j=t 


(3) 


where  VJ(-,tu)  is  the  vector  of  partial  derivatives  of  J  with 
respect  to  its  parameters.  The  positive  parameter  a  con¬ 
trols  the  learning  rate  and  would  typically  be  “annealed” 
towards  zero  during  the  course  of  a  long  series  of  games. 
The  parameter  A  £  [0, 1]  controls  the  extent  to  which  tem¬ 
poral  differences  propagate  backwards  in  time.  To  see  this, 
compare  equation  (3)  for  A  =  0: 


N-l 

w  :=w  -I-  a  ^  VJ{xt,w)dt 

t=i 

N-l 

=w  -t-  a  ^  VJ{xt,w)  -  j(xt,  w)] 

t=i 

(4) 


and  A  =  1: 


w  :—w  +  a  ^  VJ{xt,w)  j^r(a;7v)  -  tf)]  •  (5) 

t=i 

Consider  each  term  contributing  to  the  sums  in  equations 
(4)  and  (5).  For  A  =  0  the  parameter  vector  is  being  ad¬ 
justed  in  such  a  way  as  to  move  J{xt,w) — the  predicted 
reward  at  time  t — closer  to  J{xt+i,w) — the  predicted  re¬ 
ward  at  time  f -I- 1.  In  contrast,  TD(1)  adjusts  the  parameter 
vector  in  such  away  as  to  move  the  predicted  reward  at  time 
step  t  closer  to  the  final  reward  at  time  step  N.  Values  of 
A  between  zero  and  one  interpolate  between  these  two  be¬ 
haviors.  Note  that  (5)  is  equivalent  to  gradient  descent  on 

the  error  function  E{w)  := 


r{xN)  -  jixt,w)]  . 


a*(x)  :=  argmin„g^^  (6) 

This  was  the  strategy  used  in  TD-Gammon.  Unfortunately, 
for  games  like  othello  and  chess  it  is  very  difficult  to  ac¬ 
curately  evaluate  a  position  by  looking  only  one  move  or 
ply  ahead.  Most  programs  for  these  games  employ  some 
form  of  minimax  search.  In  minimax  search,  one  builds 
a  tree  from  position  x  by  examining  all  possible  moves 
for  the  computer  in  that  position,  then  all  possible  moves 
for  the  opponent,  and  then  all  possible  moves  for  the  com¬ 
puter  and  so  on  to  some  predetermined  depth  d.  The  leaf 
nodes  of  the  tree  are  then  evaluated  using  a  heuristic  eval¬ 
uation  function  (such  as  and  the  resulting  scores 

are  propagated  back  up  the  tree  by  choosing  at  each  stage 
the  move  which  leads  to  the  best  position  for  the  player  on 
the  move.  See  figure  1  for  an  example  game  tree  and  its 
minimax  evaluation.  With  reference  to  the  figure,  note  that 
the  evaluation  assigned  to  the  root  node  is  the  evaluation 
of  the  leaf  node  of  the  principal  variation-,  the  sequence  of 
moves  taken  from  the  root  to  the  leaf  if  each  side  chooses 
the  best  available  move. 

In  practice  many  engineering  tricks  are  used  to  improve  the 
performance  of  the  minimax  algorithm,  a  —  (3  search  being 
the  most  famous. 

Let  Jd{x,  w)  denote  the  evaluation  obtained  for  state  x  by 
applying  to  the  leaf  nodes  of  a  depth  d  minimax 

search  from  x.  Our  aim  is  to  find  a  parameter  vector  w 
such  that  Jdi-,w)  is  a  good  approximation  to  the  expected 
reward  J* .  One  way  to  achieve  this  is  to  apply  the  TD(A) 
algorithm  to  Jdix,w).  That  is,  for  each  sequence  of  posi- 

^If  successor  states  are  only  determined  stochastically  by  the 
choice  of  a,  we  would  choose  the  action  minimizing  the  expected 
reward  over  the  choice  of  successor  states. 
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Figure  1:  Full  breadth,  3-ply  search  tree  illustrating  the 
minimax  rule  for  propagating  values.  Each  of  the  leaf 
nodes  (H-0)  is  given  a  score  by  the  evaluation  function, 
j(-,  w).  These  scores  are  then  propagated  back  up  the  tree 
by  assigning  to  each  opponent’s  internal  node  the  minimum 
of  its  children’s  values,  and  to  each  of  our  internal  nodes  the 
maximum  of  its  children’s  values.  The  principle  variation 
is  then  the  sequence  of  best  moves  for  either  side  starting 
from  the  root  node,  and  this  is  illustrated  by  a  dashed  line 
in  the  figure.  Note  that  the  score  at  the  root  node  A  is  the 
evaluation  of  the  leaf  node  (L)  of  the  principal  variation.  As 
there  are  no  ties  between  any  siblings,  the  derivative  of  A’s 
score  with  respect  to  the  parameters  w  is  just  VJ{L,  w). 


tions  a;i , , . . ,  Xiv  in  a  game  we  define  the  temporal  differ¬ 
ences 


dt  :=  Jd{xt+i,w)  -  jdixt,w) 


(7) 


as  per  equation  (2),  and  then  the  TD(A)  algorithm  (3)  for 
updating  the  parameter  vector  w  becomes 


N-l 

w  :=w  +  a  ^  V3d{xt,w) 

f=i 


iv-i 

E 


i=t 


(8) 


One  problem  with  equation  (8)  is  that  for  d  >  1,  Jd(x,  w) 
is  not  a  necessarily  a  differentiable  function  of  w  for  all 
values  of  w,  even  if  J{-,w)  is  everywhere  differentiable. 
This  is  because  for  some  values  of  w  there  will  be  “ties”  in 
the  minimax  search,  i.e.  there  will  be  more  than  one  best 
move  available  in  some  of  the  positions  along  the  principal 
variation,  which  means  that  the  principal  variation  will  not 
be  unique  (see  figure  2).  Thus,  the  evaluation  assigned  to 
the  root  node,  Jd{x,  w),  will  be  the  evaluation  of  any  one 
of  a  number  of  leaf  nodes. 

Fortunately,  under  some  mild  technical  assumptions  on  the 
behavior  of  J(a:,  u;),  it  can  be  shown  that  for  each  state  x, 
the  set  of  in  e  1*^  for  which  Jd{x,  w)  is  not  differentiable 
has  Lebesgue  measure  zero.  Thus  for  all  states  x  and  for 
“almost  all”  w  G  ,  Jd{x,w)  is  a  differentiable  function 


Figure  2:  A  search  tree  with  a  non-unique  principal  varia¬ 
tion  (PV).  In  this  case  the  derivative  of  the  root  node  A  with 
respect  to  the  parameters  of  the  leaf-node  evaluation  func¬ 
tion  is  multi-valued,  either  VJ(i7,u;)  or  Vj{L,w).  Ex¬ 
cept  for  transpositions  (in  which  case  H  and  L  are  identical 
and  the  derivative  is  single-valued  anyway),  such  “colli¬ 
sions”  are  likely  to  be  extremely  rare,  so  in  TDLeaffA)  we 
ignore  them  by  choosing  a  leaf  node  arbitrarily  from  the 
available  candidates. 


of  tv.  Note  that  Jd{x,  w)  is  also  a  continuous  function  of 
w  whenever  J (x,  w)  is  a  continuous  function  of  w.  This 
implies  that  even  for  the  “bad”  pairs  (x,  w),  VJd{x,w)  is 
only  undefined  because  it  is  multi-valued.  Thus  we  can 
still  arbitrarily  choose  a  particular  value  for  'VJd{x,w)  if 
w  happens  to  land  on  one  of  the  bad  points; 

Based  on  these  observations  we  modified  the  TD(A)  al¬ 
gorithm  to  take  account  of  minimax  search  in  an  almost 
trivial  way:  instead  of  working  with  the  root  positions 
xi , . . . ,  x;v,  the  TD(A)  algorithm  is  applied  to  the  leaf  po¬ 
sitions  found  by  minimax  search  from  the  root  positions. 
We  call  this  algorithm  TDLeaffA).  Full  details  are  given  in 
figure  3. 


4  TDLeaf(A)  and  Chess 


In  this  section  we  describe  the  outcome  of  several  ex¬ 
periments  in  which  the  TDLeaffA)  algorithm  was  used 
to  train  the  weights  of  a  linear  evaluation  function  in 
our  chess  program  “KnightCap”.  KnightCap  is  a  reason¬ 
ably  sophisticated  computer  chess  program  for  Unix  sys¬ 
tems.  It  has  all  the  standard  algorithmic  features  that 
modern  chess  programs  tend  to  have  as  well  as  a  num¬ 
ber  of  features  that  are  much  less  common.  For  more 
details  on  KnightCap,  including  the  source  code,  see 
wwwsyseng . anu . edu . au/ Isg. 
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Let  J(-,  w)  be  a  class  of  evaluation  functions  parameterized  by  in  G  E*' .  Let  , . . . ,  i/v  be  TV  positions  that  occurred 
during  the  course  of  a  game,  with  r{xf^)  the  outcome  of  the  game.  For  notational  convenience  set  r{xi\!). 

1.  For  each  state  Xj,  compute  Jd{xi,w)  by  performing  minimax  search  to  depth  d  from  ij  and  using  to  score  the 

leaf  nodes.  Note  that  d  may  vary  from  position  to  position. 

2.  Let  x\  denote  the  leaf  node  of  the  principle  variation  starting  at  Xj.  If  there  is  more  than  one  principal  variation,  choose 
a  leaf  node  from  the  available  candidates  at  random.  Note  that 

Ja{xi,w)  =  Jix\,w).  (9) 

3.  For  f  =  1, . . . ,  TV  -  1,  compute  the  temporal  differences: 

dt  :=  Jix[^i,w)  -  j{x[,w).  (10) 


4.  Update  w  according  to  the  TDLeaf(A)  formula: 


N-l 

w  :=  w  +  a  w) 

t=l 


'n-1 


}=t 


(11) 


Figure  3:  The  TDLeaffA)  algorithm 


4.1  Experiments  with  KnightCap 

In  our  main  experiment  we  took  KnightCap’s  evaluation 
function  and  set  all  but  the  material  parameters  to  zero. 
The  material  parameters  were  initialized  to  the  standard 
“computer”  values:  1  for  a  pawn,  4  for  a  knight,  4  for  a 
bishop,  6  for  a  rook  and  12  for  a  queen.  With  these  pa¬ 
rameter  settings  KnightCap  (under  the  pseudonym  “Wimp- 
Knight”)  was  started  on  the  Free  Internet  Chess  server 
(FICS,  fics.onenet.net)  against  both  human  and 
computer  opponents.  We  played  KnightCap  for  25  games 
without  modifying  its  evaluation  function  so  as  to  get  a  rea¬ 
sonable  idea  of  its  rating.  After  25  games  it  had  a  blitz  (fast 
time  control)  rating  of  1650  ±  50^,  which  put  it  at  about 
B-grade  human  performance  (on  a  scale  from  E  (KKK))  to 
A  (18(X))),  although  of  course  the  kind  of  game  KnightCap 
plays  with  just  material  parameters  set  is  very  different  to 
human  play  of  the  same  level  (KnightCap  makes  no  short¬ 
term  tactical  errors  but  is  positionally  completely  ignorant). 
We  then  turned  on  the  TDLeaf(A)  learning  algorithm,  with 
A  =  0.7  and  the  learning  rate  a  =  1.0.  The  value  of  A  was 
chosen  heuristically,  based  on  the  typical  delay  in  moves 
before  an  error  takes  effect,  while  a  was  set  high  enough 
to  ensure  rapid  modification  of  the  parameters.  A  couple  of 
minor  modifications  to  the  algorithm  were  made: 


^the  standard  deviation  for  all  ratings  reported  in  this  section 
is  about  50 


•  The  raw  (linear)  leaf  node  evaluations  J{x\,w)  were 
converted  to  a  score  between  -1  and  1  by  computing 

v\  :=  tanh  j^;0j(x(,u;)|  . 

This  ensured  small  fluctuations  in  the  relative  values 
of  leaf  nodes  did  not  produce  large  temporal  differ¬ 
ences  (the  values  v\  were  used  in  place  of  J{x\,w) 
in  the  TDLeaf(A)  calculations).  The  outcome  of  the 
game  r(x;v)  was  set  to  1  for  a  win,  -1  for  a  loss 
and  0  for  a  draw.  0  was  set  to  ensure  that  a  value 
of  tanh  ti;)j  =  0.25  was  equivalent  to  a  ma¬ 

terial  superiority  of  1  pawn  (initially). 

•  The  temporal  differences,  dt  =  —  v[,  were  mod¬ 

ified  in  the  following  way.  Negative  values  of  dt 
were  left  unchanged  as  any  decrease  in  the  evalua¬ 
tion  from  one  position  to  the  next  can  be  viewed  as 
mistake.  However,  positive  values  of  dt  can  occur 
simply  because  the  opponent  has  made  a  blunder.  To 
avoid  KnightCap  trying  to  learn  to  predict  its  oppo¬ 
nent’s  blunders,  we  set  all  positive  temporal  differ¬ 
ences  to  zero  unless  KnigbtCap  predicted  the  oppo¬ 
nent’s  move'* 

■’in  a  later  experiment  we  only  set  positive  temporal  differ¬ 
ences  to  zero  if  KnightCap  did  not  predict  the  opponent’s  move 
and  the  opponent  was  rat^  less  than  KnightCap.  After  all,  pre¬ 
dicting  a  stronger  opponent’s  blunders  is  a  useful  skill,  although 
whether  this  made  any  difference  is  not  clear. 


KnightCap:  A  learning  chess  program  33 


•  The  value  of  a  pawn  was  kept  fixed  at  its  initial  value 
so  as  to  allow  easy  interpretation  of  weight  values 
as  multiples  of  the  pawn  value  (we  actually  experi¬ 
mented  with  not  fixing  the  pawn  value  and  found  it 
made  little  difference:  after  1764  games  with  an  ad¬ 
justable  pawn  its  value  had  fallen  by  less  than  7  per¬ 
cent). 

Within  300  games  KnightCap’s  rating  had  risen  to  2150,  an 
increase  of  500  points  in  three  days,  and  to  a  level  compa¬ 
rable  with  human  masters.  At  this  point  KnightCap’s  per¬ 
formance  began  to  plateau,  primarily  because  it  does  not 
have  an  opening  book  and  so  will  repeatedly  play  into  weak 
lines.  We  have  since  implemented  an  opening  book  learn¬ 
ing  algorithm  and  with  this  KnightCap  now  plays  at  a  rating 
of 2400-2500  (peak  2575)  on  the  other  major  internet  chess 
server:  ICC,  chessclub. com^  It  often  beats  Interna¬ 
tional  Masters  at  blitz.  Also,  because  KnightCap  automati¬ 
cally  learns  its  parameters  we  have  been  able  to  add  a  large 
number  of  new  features  to  its  evaluation  function:  Knight¬ 
Cap  currently  operates  with  5872  features  (1468  features 
in  four  stages:  opening,  middle,  ending  and  mating®.  With 
this  extra  evaluation  power  KnightCap  easily  beats  ver¬ 
sions  of  Crafty  restricted  to  search  only  as  deep  as  itself. 
However,  a  big  caveat  to  all  this  optimistic  assessment  is 
that  KnightCap  routinely  gets  crushed  by  faster  programs 
searching  more  deeply.  It  is  quite  unlikely  this  can  be  eas¬ 
ily  fixed  simply  by  modifying  the  evaluation  function,  since 
for  this  to  work  one  has  to  be  able  to  predict  tactics  stat¬ 
ically,  something  that  seems  very  difficult  to  do.  If  one 
could  find  an  effective  algorithm  for  “learning  to  search  se¬ 
lectively”  there  would  be  potential  for  far  greater  improve¬ 
ment. 

Note  that  we  have  twice  repeated  the  learning  experiment 
and  found  a  similar  rate  of  improvement  and  final  perfor¬ 
mance  level.  The  rating  as  a  function  of  the  number  of  a 
games  from  one  of  these  repeat  runs  is  shown  in  figure  4 
(we  did  not  record  this  information  in  the  first  experiment). 
Note  that  in  this  case  KnightCap  took  mearly  twice  as  long 
to  reach  the  2150  mark,  but  this  was  partly  because  it  was 
operating  with  limited  memory  (8Mb)  until  game  500  at 
which  point  the  memory  was  increased  to  40Mb  (Knight¬ 
Cap’s  search  algorithm— MTD(f)  [3]— is  a  memory  inten¬ 
sive  variant  of  a-(3  and  when  learning  KnightCap  must 

’There  appears  to  be  a  systematic  difference  of  around  200- 
250  points  between  the  two  servers,  so  a  peak  rating  of  2575  on 
ICC  roughly  corresponds  to  a  peak  of  2350  on  FICS.  We  trans¬ 
ferred  KnightCap  to  ICC  because  there  are  more  strong  players 
playing  there. 

®In  reality  there  are  not  1468  independent  “concepts”  per  stage 
in  KnightCap’s  evaluation  function  as  many  of  the  features  come 
in  groups  of  64,  one  for  each  square  on  the  board  (like  the  value 
of  placing  a  rook  on  a  particular  square,  for  example) 


Figure  4:  KnightCap’s  rating  as  a  function  of  games  played 
(second  experiment).  Learning  was  turned  on  at  game  0. 


store  the  whole  position  in  the  hash  table  so  small  mem¬ 
ory  really  hurts  the  performance).  Another  reason  may  also  • 
have  been  that  for  a  portion  of  the  run  we  were  performing 
paramater  updates  after  every  four  games  rather  than  every 
game. 

Plots  of  various  parameters  as  a  function  of  the  number  of 
games  played  are  shown  in  Figure  5  (these  plots  are  from 
the  same  experiment  in  figure  4).  Each  plot  contains  three 
graphs  corresponding  to  the  three  different  stages  of  the 
evaluation  function:  opening,  middle  and  ending^. 

Finally,  we  compared  the  performance  of  KnightCap  with 
its  learnt  weight  to  KnightCap’s  performance  with  a  set  of 
hand-coded  weights,  again  by  playing  the  two  versions  on 
ICC.  The  hand-coded  weights  were  close  in  performance 
to  the  learnt  weights  (perhaps  50-100  rating  points  worse). 
We  also  tested  the  result  of  allowing  KnightCap  to  learn 
starting  from  the  hand-coded  weights,  and  in  this  case  it 
seems  that  KnightCap  performs  better  than  when  start¬ 
ing  from  just  material  values  (peak  performance  was  2632 
compared  to  2575,  but  these  figures  are  very  noisy).  We  are 
conducting  more  tests  to  verify  these  results.  However,  it 
should  not  be  too  surprising  that  learning  from  a  good  qual¬ 
ity  set  of  hand-crafted  parameters  is  better  than  just  learn¬ 
ing  from  material  parameters.  In  particular,  some  of  the 
handcrafted  parameters  have  very  high  values  (the  value  of 
an  “unstoppable  pawn”,  for  example)  which  can  take  a  very 
long  time  to  learn  under  normal  playing  conditions,  partic¬ 
ularly  if  they  are  rarely  active  in  the  principal  leaves.  It  is 


’KnightCap  actually  has  a  fourth  and  final  stage  “mating” 
which  kicks  in  when  all  the  pieces  are  off,  but  this  stage  only  uses 
a  few  of  the  coefficients  (opponent’s  king  mobiliity  and  proximity 
of  our  king  to  the  opponent’s  king). 


34  Baxter,  Tridgell,  and  Weaver 


DOUBLED.PAWN 


Figure  5:  Evolution  of  two  paramaters  (bonus  for  castling 
and  penalty  for  a  doubled  pawn)  as  a  function  of  the  num¬ 
ber  of  games  played.  Note  that  each  parameter  appears 
three  times:  once  for  each  of  the  three  stages  in  the  evalua¬ 
tion  function. 

not  yet  clear  whether  given  a  sufficient  number  of  games 
this  dependence  on  the  initial  conditions  can  be  made  to 
vanish. 

4.2  Discussion 

There  appear  to  be  a  number  of  reasons  for  the  remarkable 
rate  at  which  KnightCap  improved. 

1.  As  all  the  non-material  weights  were  initially  zero, 
even  small  changes  in  these  weights  could  cause  very 
large  changes  in  the  relative  ordering  of  materially 
equal  positions.  Hence  even  after  a  few  games  Knight- 
Cap  was  playing  a  substantially  better  game  of  chess. 

2.  It  seems  to  be  important  that  KnightCap  started  out 
life  with  intelligent  material  parameters.  This  put  it 


close  in  parameter  space  to  many  far  superior  param¬ 
eter  settings. 

3.  Most  players  on  FICS  prefer  to  play  opponents  of  sim¬ 
ilar  strength,  and  so  KnightCap’s  opponents  improved 
as  it  did.  This  may  have  had  the  effect  of  guiding 
KnightCap  along  a  path  in  weight  space  that  led  to 
a  strong  set  of  weights. 

4.  KnightCap  was  learning  on-line,  not  by  self-play.  The 
advantage  of  on-line  play  is  that  there  is  a  great  deal 
of  information  provided  by  the  opponent’s  moves.  In 
particular,  against  a  stronger  opponent  KnightCap  was 
being  shown  positions  that  1)  could  be  forced  (against 
KnightCap’s  weak  play)  and  2)  were  mis-evaluated  by 
its  evaluation  function.  Of  course,  in  self-play  Knight¬ 
Cap  can  also  discover  positions  which  are  misevalu- 
ated,  but  it  will  not  find  the  kinds  of  positions  that 
are  relevant  to  strong  play  against  other  opponents.  In 
this  setting,  one  can  view  the  information  provided  by 
the  opponent’s  moves  as  partially  solving  the  “explo¬ 
ration”  part  of  the  exploration/exploitation  tradeoff. 

To  further  investigate  the  importance  of  some  of  these 
reasons,  we  conducted  several  more  experiments. 

Good  initial  conditions. 

A  second  experiment  was  run  in  which  KnightCap’s  co¬ 
efficients  were  all  initialised  to  the  value  of  a  pawn.  The 
value  of  a  pawn  needs  to  be  positive  in  KnightCap  be¬ 
cause  it  is  used  in  many  other  places  in  the  code:  for 
example  we  deem  the  MTD  search  to  have  converged  if 
a  <  0  +  0.07+PAWN.  Thus,  to  set  all  parameters  equal  to 
the  same  value,  that  value  had  to  be  a  pawn. 

Playing  with  the  initial  weight  settings  KnightCap  had  a 
blitz  rating  of  around  1250.  After  more  than  1000  games 
on  FICS  KnightCap’s  rating  has  improved  to  about  1550, 
a  300  point  gain.  This  is  a  much  slower  improvement 
than  the  original  experiment.  We  do  not  know  whether 
the  coefficients  would  have  eventually  converged  to  good 
values,  but  it  is  clear  from  this  experiment  that  starting 
near  to  a  good  set  of  weights  is  important  for  fast  con¬ 
vergence.  An  interesting  avenue  for  further  exploration 
here  is  the  effect  of  A  on  the  learning  rate.  Because  the 
initial  evaluation  function  is  completely  wrong,  there 
would  be  some  justification  in  setting  A  =  1  early  on  so 
that  KnightCap  only  tries  to  predict  the  outcome  of  the 
game  and  not  the  evaluations  of  later  moves  (which  are 
extremely  unreliable). 

Self-Play 

Learning  by  self-play  was  extremely  effective  for  TD- 
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Gammon,  but  a  significant  reason  for  this  is  the  randomness 
of  backgammon  which  ensures  that  with  high  probabil¬ 
ity  different  games  have  substantially  different  sequences 
of  moves,  and  also  the  speed  of  play  of  TD-Gammon 
which  ensured  that  learning  could  take  place  over  several 
hundred-thousand  games.  Unfortunately,  chess  programs 
are  slow,  and  chess  is  a  deterministic  game,  so  self-play  by 
a  deterministic  algorithm  tends  to  result  in  a  large  number 
of  substantially  similar  games.  This  is  not  a  problem  if  the 
games  seen  in  self-play  are  “representative”  of  the  games 
played  in  practice,  however  KnightCap’s  self-play  games 
with  only  non-zero  material  weights  are  very  different  to 
the  kind  of  games  humans  of  the  same  level  would  play. 

To  demonstrate  that  learning  by  self-play  for  KnightCap  is 
not  as  effective  as  learning  against  real  opponents,  we  ran 
another  experiment  in  which  all  but  the  material  parame¬ 
ters  were  initialised  to  zero  again,  but  this  time  KnightCap 
learnt  by  playing  against  itself.  After  600  games  (twice  as 
many  as  in  the  original  FICS  experiment),  we  played  the  re¬ 
sulting  version  against  the  good  version  that  learnt  on  FICS 
for  a  further  100  games  with  the  weight  values  fixed.  The 
self-play  version  scored  only  11%  against  the  good  FICS 
version. 

Simultaneously  with  the  work  presented  here,  Beal 
and  Smith  [1]  reported  positive  results  using  essentially 
TDLeaf(A)  and  self-play  (with  some  random  move  choice) 
when  learning  the  parameters  of  an  evaluation  function  that 
only  computed  material  balance.  However,  they  were  not 
comparing  performance  against  on-line  players,  but  were 
primarily  investigating  whether  the  weights  would  con¬ 
verge  to  “sensible”  values  at  least  as  good  as  the  naive  (1, 3, 
3, 5, 9)  values  for  (pawn,  knight,  bishop,  rook,  queen)  (they 
did,  within  2000  games,  and  using  a  value  of  A  =  0.95 
which  supports  the  discussion  in  “good  initial  conditions” 
above). 

5  Conclusion 

We  have  introduced  TDLeaf(A),  a  variant  of  TD(A)  suitable 
for  training  an  evaluation  function  used  in  minimax  search. 
The  only  extra  requirement  of  the  algorithm  is  that  the  leaf- 
nodes  of  the  principal  variations  be  stored  throughout  the 
game. 

We  presented  some  experiments  in  which  a  chess  evalua¬ 
tion  function  was  trained  from  B-grade  to  master  level  us¬ 
ing  TDLeaf(A)  by  on-line  play  against  a  mixture  of  human 
and  computer  opponents.  The  experiments  show  both  the 
importance  of  “on-line”  sampling  (as  opposed  to  self-play) 
for  a  deterministic  game  such  as  chess,  and  the  need  to 
start  near  a  good  solution  for  fast  convergence,  although 
just  how  near  is  still  not  clear. 


On  the  theoretical  side,  it  has  recently  been  shown  that 
TD(A)  converges  for  linear  evaluation  functions[ll]  (al¬ 
though  only  in  the  sense  of  prediction,  not  control).  An 
interesting  avenue  for  further  investigation  would  be  to  de¬ 
termine  whether  TDLeaf(A)  has  similar  convergence  prop¬ 
erties. 
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Abstract 

Combining  multiple  classifiers  is  an  effective 
technique  for  improving  accuracy.  There  are 
many  general  combining  algorithms,  such  as 
Bagging  or  Error  Correcting  Output  Coding, 
that  significantly  improve  classifiers  like  deci¬ 
sion  trees,  rule  learners,  or  neural  networks. 
Unfortunately,  many  combining  methods  do 
not  improve  the  nearest  neighbor  classifier. 
In  this  paper,  we  present  MFS,  a  combining 
algorithm  designed  to  improve  the  accuracy 
of  the  nearest  neighbor  (NN)  classifier.  MFS 
combines  multiple  NN  classifiers  each  using 
only  a  random  subset  of  features.  The  ex¬ 
perimental  results  are  encouraging:  On  25 
datasets  from  the  UCI  Repository,  MFS  sig¬ 
nificantly  improved  upon  the  NN,  k  near¬ 
est  neighbor  (kNN),  and  NN  classifiers  with 
forward  and  backward  selection  of  features. 
MFS  was  also  robust  to  corruption  by  irrele¬ 
vant  features  compared  to  the  kNN  classifier. 
Finally,  we  show  that  MFS  is  able  to  reduce 
both  bias  and  variance  components  of  error. 


1  INTRODUCTION 

The  nearest  neighbor  (NN)  classifier  is  one  of  the  old¬ 
est  and  simplest  methods  for  performing  general,  non- 
parametric  classification.  It  can  be  represented  by  the 
following  rule:  to  classify  an  unknown  pattern,  choose 
the  class  of  the  nearest  example  in  the  training  set  as 
measured  by  a  distance  metric.  A  common  extension 

*  Research  performed  while  at  the  University  of  Water¬ 
loo,  Department  of  Systems  Design  Engineering,  Waterloo, 
Ont.,  N2L  3G1,  Canada. 


is  to  choose  the  most  common  class  in  the  k  nearest 
neighbors  (kNN). 

Despite  its  simplicity,  the  NN  classifier  has  many  ad¬ 
vantages  over  other  methods.  For  example,  it  can  learn 
from  a  small  set  of  examples,  can  incrementally  add 
new  information  at  runtime,  and  often  gives  competi¬ 
tive  performance  with  more  modern  methods  such  as 
decision  trees  or  neural  networks. 

Since  its  inception  by  Fix  and  Hodge  (1951),  re¬ 
searchers  have  investigated  many  methods  for  improv¬ 
ing  the  NN  classifier,  but  most  work  has  concen¬ 
trated  on  changing  the  distance  metric  or  manipulat¬ 
ing  the  patterns  in  the  training  set  (Dasarathy,  1991). 
Recently,  researchers  have  begun  experimenting  with 
general  algorithms  for  improving  classification  accu¬ 
racy  by  combining  multiple  versions  of  a  single  classi¬ 
fier,  also  known  as  a  multiple  model  or  ensemble  ap¬ 
proach.  The  outputs  of  several  classifiers  are  combined 
in  the  hope  that  the  accuracy  of  the  whole  is  greater 
than  the  parts.  Unfortunately,  many  combining  meth¬ 
ods  do  not  improve  the  NN  classifier  at  all. 

For  example,  in  Breiman’s  (1996)  experiments  with 
Bagging,  he  found  no  difference  in  accuracy  between 
the  bagged  NN  classifier  and  the  single  model  ap¬ 
proach.  His  results  suggest  that  other  combining 
methods  that  involve  any  significant  degree  of  resam¬ 
pling  or  replication  of  patterns  will  not  work  with  the 
NN  classifier.  Kong  and  Dietterich  (1996)  also  con¬ 
cluded  that  Error  Correcting  Output  Coding  (ECOC), 
a  method  of  combining  classifiers  by  decomposing 
multi-class  problems  into  multiple  two-class  problems, 
will  not  improve  classifiers  that  use  local  information 
because  of  high  error  correlation.  For  example,  with 
the  NN  classifier  we  predict  the  class  of  the  closest  pat¬ 
tern.  This  pattern  is  the  same  in  all  of  the  two-class 
problems,  and  hence  if  it  gives  an  incorrect  prediction, 
all  the  predictions  in  the  ECOC  ensemble  will  be  in- 
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correct 

In  this  paper,  we  present  a  new  method  of  combining 
nearest  neighbor  classifiers  with  the  goal  of  improv¬ 
ing  classification  accuracy.  Our  approach  manipulates 
the  features  that  the  individual  classifiers  use.  In  con¬ 
trast,  other  combining  algorithms  may  manipulate  the 
training  patterns  (Bagging,  Boosting)  or  the  class  la¬ 
bels  (ECOC). 

In  the  next  section,  we  describe  the  MFS  algorithm 
for  combining  multiple  NN  classifiers.  In  Section  3, 
we  evaluate  the  algorithm  on  datasets  from  the  UCI 
Repository  for  accuracy,  computational  complexity, 
and  robustness  to  irrelevant  features.  In  Section  4,  we 
analyze  the  algorithm’s  bias  and  variance  components 
of  error.  In  Section  5,  we  discuss  related  work,  and 
follow  it  by  conclusions  and  future  work  in  Section  6. 

2  CLASSIFICATION  FROM 

MULTIPLE  FEATURE  SUBSETS 

We  start  by  describing  the  MFS  algorithm  and  then 
we  discuss  the  motivation  behind  it  and  the  dangers  in 
using  it.  We  then  explain  how  we  set  the  algorithm’s 
parameters. 

2.1  THE  MFS  ALGORITHM 

The  algorithm  for  nearest  neighbor  classification  from 
multiple  feature  subsets  (MFS)  is  simple  and  can  be 
stated  as: 

Using  simple  voting,  combine  the  out¬ 
puts  from  multiple  NN  classifiers,  each 
having  access  only  to  a  random  subset 
of  features. 

We  select  the  random  subset  of  features  by  sampling 
from  the  original  set.  We  use  two  different  sampling 
functions:  sampling  with  replacement,  and  sampling 
without  replacement.  In  sampling  with  replacement,  a 
feature  can  be  selected  more  than  once  which  is  equiv¬ 
alent  to  increasing  its  weight. 

Each  of  the  NN  classifiers  uses  the  same  number  of 
features.  This  is  a  parameter  of  the  algorithm  which 
we  set  by  cross-validation  performance  estimates  on  a 
tuning  dataset  (see  Section  2.2).  Each  time  a  pattern 

^Recently  Ricci  and  Aha  (1998)  have  developed  a 
method  for  combining  NN  classifiers  and  ECOC  which 
solves  the  correlation  problem.  We  discuss  this  in  section  5. 


is  presented  for  classification,  we  select  a  new  random 
subset  of  features  for  each  classifier. 

As  an  example  of  MFS  classification,  consider  Fisher’s 
iris  plant  classification  problem  (Fisher,  1936;  Duda 
and  Hart,  1973).  In  this  domain,  we  try  to  classify 
iris  plants  into  their  specific  species:  iris-setosa,  iris- 
virginica,  and  iris- versicolor,  based  on  the  following 
four  features:  petal  length,  petal  width,  sepal  length, 
and  sepal  width.  With  MFS  we  might  use  three  NN 
classifiers  each  using  a  random  subset  of  features.  The 
first  NN  classifier  might  use  {petal  length,  sepal  width, 
sepal  length),  the  second  might  use  {petal  width,  petal 
length,  sepal  width),  and  the  third  might  use  {petal 
width,  sepal  width,  sepal  width)  which  we  would  treat 
as  {petal  width,  2  x  sepal  width). 

The  idea  of  using  only  a  random  subset  of  features 
may  seem  counter  intuitive,  as  we  are  throwing  away 
potentially  valuable  information.  The  accuracy  of  the 
NN  classifiers  is  likely  to  decrease  compared  to  a  clas¬ 
sifier  that  has  access  to  all  the  features.  Should  we 
not  use  all  the  information  and  make  each  classifier  as 
accurate  as  possible?  Why  should  we  create  a  set  of 
classifiers  each  less  accurate  than  a  single  one  trained 
on  all  the  information? 

The  answer  to  these  questions  lies  in  the  dynamics 
of  simple  voting  among  a  set  of  classifiers.  The  in¬ 
dividual  models  do  not  need  to  be  very  accurate  for 
the  system  as  a  whole  to  achieve  high  accuracy,  if  the 
models  make  different  errors.  In  particular,  Hansen 
and  Salamon  (1990)  showed  that  under  simple  voting 
if  the  models  make  independent  errors,  then  the  over¬ 
all  error  will  decrease  monotonically  with  increasing 
numbers  of  classifiers.  Ali  and  Pazzani  (1996)  verified 
empirically  that  combining  models  with  uncorrelated 
errors  could  significantly  reduce  the  overall  error.  Se¬ 
lecting  different  features  is  an  attempt  to  force  the  NN 
classifiers  to  make  different  and  uncorrelated  errors. 
We  are  trading  off  accuracy  for  error  diversity. 

There  is  no  guarantee  that  using  different  feature  sets 
for  the  NN  classifiers  will  decorrelate  error.  However, 
Turner  and  Ghosh  (1996)  found  that  with  neural  net¬ 
works,  selectively  removing  features  could  decorrelate 
errors.  Unfortunately,  the  error  rates  in  the  individual 
classifiers  increased,  and  as  a  result  there  was  little  or 
no  improvement  in  the  ensemble.  Cherkauer  (1996) 
was  more  successful,  and  was  able  to  combine  neural 
networks  that  used  different  hand  selected  features  to 
achieve  human  expert  level  performance  in  identifying 
volcanoes  from  images. 
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One  method  of  generating  a  diverse  ensemble  of  clas¬ 
sifiers  is  to  perturb  some  aspect  of  the  training  inputs 
for  which  the  classifier  is  unstable.  For  example,  Bag¬ 
ging  (Breiman,  1996)  perturbs  the  training  patterns 
available  to  each  classifier  in  the  ensemble.  Since  deci¬ 
sion  trees  are  unstable  to  the  patterns,  Bagging  gener¬ 
ates  a  diverse  and  effective  ensemble.  Nearest  neigh¬ 
bor  classifiers  are  stable  to  the  patterns,  so  Bagging 
generates  poor  NN  ensembles.  Nearest  Neighbor  clas¬ 
sifiers,  however,  are  extremely  sensitive  to  the  features 
used.  For  example,  Langley  and  Iba  (1993)  found  that 
adding  just  a  few  irrelevant  features  could  drastically 
change  the  NN  classifier’s  outputs  (and  reduce  accu¬ 
racy).  MFS  attempts  to  use  this  instability  to  generate 
a  diverse  set  of  NN  classifiers  with  uncorrelated  errors. 

The  above  discussion  hopefully  provides  motivation  for 
why  we  expect  that  MFS  will  improve  the  accuracy 
of  the  nearest  neighbor  classifier.  However,  there  are 
three  major  dangers  that  we  should  be  aware  of  when 
using  MFS: 

1.  Simple  voting  can  only  improve  accuracy  if  the 
classifiers  select  the  correct  class  more  often  than 
any  other  class.  Breiman  refers  to  this  as  order 
correctness.  If  the  classifiers  are  not  order  correct, 
then  simple  voting  will  increase  the  expected  er¬ 
ror.  For  two  class  problems,  we  require  slightly 
more  than  50%  accuracy  in  the  voting  classifiers 
to  improve  accuracy.  With  multiple  classes,  the 
required  accuracy  may  drop  as  low  as  ^  where  C 
is  the  number  of  classes. 

2.  The  Bayes  error  rate  can  only  increase  by  using  a 
subset  of  features.  This  may  make  it  difficult  for 
the  NN  classifiers  used  by  MFS  to  meet  the  re¬ 
quirements  in  point  1.  For  example,  in  the  parity 
problem,  a  domain  with  highly  interacting  fea¬ 
tures,  the  Bayes  error  rate  in  any  proper  subset 
of  features  is  50%  (as  opposed  to  0%  for  the  full 
feature  space).  There  is  no  guarantee  that  ran¬ 
dom  subsets  will  have  the  necessary  information 
for  accurate  classification. 

3.  By  using  the  nearest  neighbor  classifier  in  the 
MFS  scheme  we  lose  its  asymptotic  optimality 
properties.  Specifically,  as  the  number  of  train¬ 
ing  examples  approaches  infinity  the  NN  classifier 
is  bounded  by  twice  the  Bayes  error  rate  (Cover, 
1967).  The  kNN  classifier  is  Bayes  optimal  in  the 
limit  with  proper  choice  of  k  (Fix  and  Hodges, 
1951).  We  can  make  no  such  claims  about  MFS. 


2.2  PARAMETER  SELECTION 

The  MFS  algorithm  has  two  parameter  values  that 
need  to  be  set:  the  size  of  the  feature  subsets,  and  the 
number  of  classifiers  to  combine. 

We  set  MFS’s  subset  size  parameter  based  on  cross- 
validation  accuracy  estimates  on  the  training  set  for 
the  entire  ensemble.  We  evaluated  ten  evenly  spaced 
intervals  over  the  size  of  the  original  feature  set.  For 
example,  if  a  domain  had  34  features  then  the  subset 
sizes  at  3,7,10,. . .  ,34  were  evaluated.  In  the  case  of 
ties,  the  smaller  value  was  chosen. 

We  set  the  number  of  classifiers  by  evaluating  the  per¬ 
formance  of  MFS  on  seven  development  datasets  vary¬ 
ing  the  number  of  classifiers  from  10  to  1000.  Based  on 
the  results,  we  set  the  number  of  classifiers  to  100  as 
a  reasonable  trade-off  between  computational  expense 
and  accuracy. 

3  EXPERIMENTS 

3.1  METHODS 

We  evaluated  the  performance  of  MFS  using  two  dif¬ 
ferent  sampling  functions:  sampling  with  replacement 
(MFSl)  and  sampling  without  replacement  (MFS2). 
We  compared  these  to  four  other  algorithms:  near¬ 
est  neighbor  (NN),  k  nearest  neighbor  (kNN),  nearest 
neighbor  with  forward  (FSS)  and  backward  (BSS)  se¬ 
quential  selection  of  features  (Aha  and  Bankert,  1994). 

The  use  of  FSS  and  BSS  should  provide  an  interesting 
contrast  with  MFS.  FSS  and  BSS  try  to  find  a  sin¬ 
gle  good  subset  of  features,  while  MFS  uses  multiple 
random  subsets  without  regard  to  their  performance. 

All  classifiers  used  unweighted  Euclidean  distance  for 
continuous  features  and  Hamming  distance  for  sym¬ 
bolic  features.  Missing  values  were  treated  as  infor¬ 
mative  and  considered  to  be  a  specific  symbolic  value. 
In  the  case  of  continuous  features  (normalized  to  [0,1]), 
a  missing  value  is  considered  to  have  a  distance  of  1 
to  all  non  missing  values.  For  the  kNN  classifier,  the 
value  of  k  was  set  using  cross-validation  performance 
estimates  on  the  training  set.  For  feature  selection, 
we  used  cross-validation  accuracy  on  the  training  set 
for  our  objective  function  (also  known  as  a  wrapper 
approach  (Kohavi  and  John,  1996)). 

We  evaluated  the  algorithms  on  twenty-five  datasets 
from  the  UCI  Repository  of  Machine  Learning 
Databases  (Merz  and  Murphy,  1998).  We  first  normal¬ 
ized  the  datasets  so  that  continuous  features  ranged 
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from  [0, 1],  and  then  we  ran  thirty  trials  where  the 
training  set  contained  2/3  of  the  patterns  (randomly 
selected)  and  the  test  set  contained  the  remaining  1/3. 

There  were  a  few  exceptions  to  this  procedure.  For 
Waveform,  we  used  300  training  cases  and  4700  test 
cases  to  maintain  consistency  with  reported  results 
(Quinlan,  1996).  For  Satimage,  we  used  the  origi¬ 
nal  division  into  a  training  and  test  set,  so  the  results 
represent  one  run  of  each  algorithm.  For  the  Musk 
dataset,  which  has  166  features,  FSS  and  BSS  took 
too  long  to  run  (over  24  hours  for  a  single  trial)  and 
no  results  were  obtained. 

3.2  ACCURACY 

The  accuracy  and  parameter  selection  results  (average 
k  or  number  of  features  selected)  are  shown  in  Table  1. 
The  first  seven  datasets  were  used  in  the  development 
of  the  MFS  algorithm.  The  default  accuracy  is  the 
frequency  of  the  most  common  class. 

The  results  show  that  MFS  is  promising:  MFSl  and 
MFS2  were  about  2%  more  accurate  over  all  domains 
than  it’s  nearest  competitor  kNN.  MFSl  was  best  on 
16  domains  out  of  25  (not  including  MFS2).  MFS2  was 
best  on  14  domains  and  tied  in  3  (not  including  MFSl). 
For  a  formal  comparison,  we  used  the  Wilcoxon  signed 
rank  test  and  found  that  MFSl  and  MFS2  were  signif¬ 
icantly  better  than  all  others  with  a  confidence  level 
greater  than  99%. 

MFS  only  performed  poorly  on  two  datasets:  Iris  and 
Tic-Tac-Toe.  For  Iris,  both  MFSl  and  MFS2  gave  the 
lowest  accuracy  out  of  all  the  classifiers.  This  can  pos¬ 
sibly  be  explained  by  the  small  number  of  features  in 
the  Iris  dataset.  With  only  four  features,  many  of  the 
feature  subsets  would  be  identical.  This  would  lead 
to  identical  errors  and  high  error  correlation.  For  Tic- 
Tac-Toe,  MFSl  performed  extremely  poorly,  having 
an  error  rate  almost  five  times  that  of  the  NN  and 
kNN  classifiers.  MFSl  probably  performed  poorly  be¬ 
cause  in  the  Tic-Tac-Toe  domain  the  features  have  a 
high  amount  of  interaction.  We  need  to  examine  all 
the  features  to  determine  which  side  has  won.  Taking 
a  random  subset  of  features  does  not  make  sense  and 
would  probably  lead  to  a  greatly  increased  Bayes  error 
rate  for  the  individual  classifiers.  MFS2  did  not  experi¬ 
ence  the  same  degradation  as  MFSl  because  sampling 
without  replacement  degenerated  into  selecting  all  the 
features  and  hence  performing  identically  to  NN. 

Comparing  MFSl  to  MFS2,  it  is  not  clear  which  clas¬ 
sifier  performed  better.  MFSl  was  better  than  MFS2 
on  15  domains,  worse  on  7,  and  tied  in  3.  However, 


MFS2  had  a  slightly  better  average  accuracy  as  it  did 
not  have  a  catastrophic  failure  on  Tic-Tac-Toe.  The 
Wilcoxon  test  did  not  detect  a  significant  difference 
between  them. 

3.3  COMPUTATIONAL  COMPLEXITY 

The  nearest  neighbor  classifier  is  often  criticized  for 
slow  runtime  performance,  so  we  will  briefly  comment 
on  the  complexity  of  MFS  and  then  present  actual 
running  times  from  the  experiments. 

The  NN  classifier  computes  the  distance  between  the 
test  pattern  and  every  pattern  in  the  training  set.  This 
requires  0{ef)  time,  where  e  is  the  number  of  ex¬ 
amples,  and  /  is  the  number  of  features.  For  MFS, 
we  use  n  NN  classifiers,  so  its  complexity  is  0{nef). 
For  training,  we  use  cross-validation  and  MFS  requires 
0{ne^fv)  time,  where  v  is  the  number  of  folds  (Bay, 
1997). 

This  analysis  shows  how  the  computational  require¬ 
ments  of  MFS  change  as  a  function  of  the  number  of 
examples  and  features.  However,  it  does  not  give  any 
indication  of  actual  running  times  on  real  datasets. 
Therefore  in  Table  2  we  list  the  actual  running  times 
on  an  Intel  Pentium  Pro  processor  for  NN  and  MFS 
on  the  three  slowest  datasets. 


Table  2:  Time  Requirements  for  NN  and  MFSl 


Domain 

Classification  Training 

NN  MFSl  MFSl 

Satimage 

Segment 

Annealing 

0.080s/pat  0.415s/pat  4.6h 

0.015s/pat  0.075s/pat  19.9m 

0.018s/pat  0.073s/pat  5.5m 

Note  that  even  though  we  are  combining  100  classifiers 
in  MFS,  it  was  only  about  five  times  as  slow  as  the  NN 
classifier.  We  attribute  this  speed  up  to  caching  the 
difference  in  feature  values  between  the  test  pattern 
and  all  patterns  in  the  training  set  (i.e.  in  d(x,y)  = 
-  !//)^)  =  >  we  cache  (i/  -  j//)^). 

3.4  ROBUSTNESS  TO  IRRELEVANT 
FEATURES 

A  major  drawback  of  the  NN  classifier  is  its  sensitivity 
to  irrelevant  features.  This  concerns  us  because  the 
MFS  algorithm  uses  multiple  NN  classifiers  and  hence 
raises  the  question:  how  will  the  ensemble  behave?  If 
the  accuracy  of  the  individual  NN  classifiers  drops  too 
low,  simple  voting  can  increase  the  error  rate.  Since 
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Table  1:  Accuracy  and  Parameter  Selection  Results  (average  k  or  number  of  features  selected) 


Domain 

Pat/F 

Def. 

NN 

kNN 

Accuracy 

FSS  BSS 

MFSl 

MFS2 

Average  Parameter  Settings 
kNN  FSS  BSS  MFSl  MFS2 

Glass 

214/9 

35.5 

67.9 

66.8 

72.3 

72.5 

75.8 

76.1 

1.7 

4.8 

5.5 

Hepatitis 

155/19 

79.4 

79.2 

80.4 

80.3 

77.2 

82.7 

82.6 

6.7 

2.4 

12.8 

Ionosphere 

351/34 

64.1 

86.5 

85.5 

88.2 

87.9 

93.5 

92.7 

1.8 

4.6 

21.9 

6.9 

6.5 

Iris 

33.3 

94.3 

95.1 

93.7 

93.5 

92.5 

92.7 

6.1 

1.4 

2.3 

2.8 

Liver-Disorders 

345/7 

58.0 

60.4 

61.3 

56.8 

60.0 

65.4 

64.4 

9.7 

1.9 

4.2 

Pima  Diabetes 

768/8 

65.1 

69.7 

73.6 

67.7 

68.5 

72.5 

72.3 

11.5 

2.0 

6.5 

pM 

Sonar 

53.4 

85.0 

85.1 

76.0 

84.3 

87.3 

87.0 

1.1 

6.3 

38.2 

Annealing 

898/38 

76.2 

98.0 

98.8 

98.8 

98.6 

1.0 

8.2 

9.0 

31.6 

21.3 

Automobile 

205/25 

32.7 

70.9 

70.9 

74.2 

72.8 

72.5 

73.3 

1.0 

3.3 

10.3 

8.7 

6.3 

Breast  Cancer 

70.3 

65.9 

74.3 

71.0 

70.0 

74.0 

74.0 

8.0 

1.9 

5.0 

6.7 

4.6 

Credit 

690/15 

55.5 

81.6 

85.5 

85.7 

81.6 

86.3 

85.8 

12.4 

3.2 

10.5 

8.8 

6.3 

German 

1000/20 

70.0 

70.5 

73.1 

70.6 

68.8 

74.4 

74.2 

10.8 

3.0 

15.7 

15.4 

11.2 

Horse  Colic 

368/22 

63.0 

76.8 

79.8 

83.9 

76.5 

80.2 

79.8 

15.1 

2.4 

14.8 

9.8 

7.8 

Labor 

57/16 

64.9 

92.1 

90.4 

78.6 

89.5 

94.2 

94.6 

2.3 

2.8 

7.5 

6.7 

5.1 

Lymphography 

148/18 

54.7 

74.6 

77.0 

74.8 

76.7 

81.9 

80.4 

8.7 

3.7 

12.1 

11.6 

8.3 

Musk 

476/166 

56.5 

84.3 

83.9 

na 

na 

88.9 

88.6 

1.4 

na 

na 

18.1 

19.1 

Primary- Tumor 

339/17 

24.5 

37.0 

43.5 

37.8 

38.9 

44.5 

45.0 

13.8 

6.3 

11.2 

10.6 

8.1 

Satimage 

6435/36 

22.8 

89.5 

90.4 

88.0 

89.4 

91.5 

91.0 

3 

10 

33 

14 

11 

Segment 

2310/19 

14.3 

93.5 

93.0 

96.5 

96.6 

96.8 

96.6 

4.6 

4.8 

9.9 

10.3 

7.9 

Soybean-Large 

683/35 

13.0 

90.7 

90.5 

93.2 

90.7 

93.4 

93.2 

1.5 

11.9 

20.2 

21.9 

14.9 

Tic-Tac-Toe 

958/9 

65.3 

98.1 

98.1 

87.8 

98.1 

91.1 

98.1 

1.0 

6.6 

9.0 

9.0 

9.0 

Vehicle 

946/18 

25.8 

68.1 

67.7 

66.6 

70.4 

71.4 

71.4 

5.7 

5.4 

12.5 

9.7 

6.8 

Vote 

435/16 

54.8 

92.9 

93.1 

95.8 

94.6 

94.9 

94.5 

4.3 

2.8 

9.2 

11.8 

8.4 

Waveform 

5000/21 

33.9 

74.9 

81.4 

70.3 

74.4 

81.0 

80.9 

13.7 

7.4 

16.8 

10.0 

8.1 

Wine 

178/13 

39.9 

95.2 

96.7 

92.8 

94.8 

97.6 

97.9 

9.8 

4.1 

7.8 

3.8 

3.5 

average 

49.1 

79.9 

81.4 

79.2 

80.3 

83.3 

83.4 

6.3 

4.6 

12.8 

8.3 

we  are  unsure  of  how  the  ensemble  will  behave,  we 
experimentally  investigated  the  robustness  of  MFS  to 
irrelevant  features. 

We  used  the  same  basic  procedure  in  Section  3.1.  We 
added  10,  20,  and  30  boolean  irrelevant  features  to 
each  of  the  datasets  and  then  measured  the  accuracy  of 
kNN  and  MFSl.  We  chose  boolean  irrelevant  features 
because  they  are  more  difficult  for  nearest  neighbor 
methods  to  handle  than  continuous  irrelevant  features. 
This  is  because  while  they  both  have  the  same  range 
and  mean,  boolean  variables  have  greater  variance. 

Table  3  shows  the  results  for  several  domains.  The 
remaining  results  (Bay,  1997)  are  not  shown  here  for 
space  reasons,  but  they  follow  a  similar  pattern. 

As  expected,  irrelevant  features  always  hurt  both  kNN 
and  MFS  to  some  degree.  However,  the  results  are 
surprising  because  they  reveal  that  on  some  domains 
kNN  is  critically  sensitive  while  MFS  is  stable.  For  ex¬ 
ample,  on  Vehicle  and  Wine  with  10  added  irrelevant 
features,  kNN  drops  in  accuracy  by  over  20%  while 
MFS  drops  by  less  than  2%.  In  general,  MFS  had  only 
minor  degradations  in  accuracy  and  was  occasionally 
very  robust.  For  example,  MFS’s  accuracy  on  Iono¬ 


sphere  degrades  by  so  little  (from  93.5%  to  90.1%),  it 
is  still  better  on  the  dataset  corrupted  by  30  irrelevant 
features,  than  all  of  the  other  classifiers  on  the  original 
dataset. 

One  possible  explanation  for  MFS’s  performance  lies 
in  how  random  voters  affect  the  margins  of  victory 
in  simple  voting.  For  simplicity,  let  us  divide  all  vot¬ 
ers  into  two  types:  informed  (using  relevant  features) 
and  uninformed  (random)  voters.  The  informed  vot¬ 
ers  cast  their  ballots,  and  the  winner  will  have  a  given 
margin  of  votes  compared  to  the  next  closest  competi¬ 
tor.  The  uninformed,  random  voters  then  cast  their 
ballots.  The  random  voters  vote  with  equal  proba¬ 
bility  and  equal  expectation  for  all  competitors  (ac¬ 
cording  to  a  multinomial  distribution).  In  order  for 
random  voting  to  change  the  outcome,  the  number 
of  random  votes  for  class  X  must  meet  the  follow¬ 
ing  inequality;  randvotes{X)  —  randvotes{trueclass)  > 
margin{trueclass,  X).  Unless  the  margins  from  the  in¬ 
formed  voters  are  small,  this  is  unlikely  to  occur  since 
the  E{randvotes{X))  =  E{randvotes{trueclass)). 

As  a  numerical  example,  consider  a  two  class  problem 
with  fifty  informed  voters  and  fifty  random  voters.  The 
fifty  informed  voters  cast  their  ballots  and  the  outcome 
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is  30  votes  for  class  A  and  20  votes  for  class  B.  The 
fifty  uninformed  voters  then  cast  their  ballots.  In  order 
for  the  uninformed  voters  to  change  the  outcome  of 
the  vote  (class  A  wins)  at  least  30  must  vote  for  class 
B.  The  probability  that  the  decision  will  change  is 
approximately  8%. 

This  situation  is  analogous  to  what  occurs  when  MFS 
is  applied  to  domains  with  irrelevant  features.  The  NN 
classifiers  are  the  voters,  and  can  become  uninformed 
and  random  when  both  of  the  following  conditions  are 
met:  (1)  the  randomly  selected  features  are  irrelevant, 
and  (2)  the  occurrence  of  the  classes  in  the  training 
set  are  roughly  equal  (this  is  true  in  many  of  the  UCI 
datasets).  Note  that  if  only  the  first  condition  is  met, 
the  NN  classifier  will  be  random  but  will  choose  classes 
roughly  in  proportion  to  their  frequencies  in  the  train¬ 
ing  set. 

4  BIAS- VARIANCE  ANALYSIS  OF 
ERROR 

The  expected  error  of  an  algorithm  can  be  divided  into 
two  components:  bias  which  is  the  consistent  error 
that  the  algorithm  makes  over  many  different  runs, 
and  variance  which  is  error  that  fluctuates  from  run 
to  run.  This  decomposition  is  a  useful  method  for  ex¬ 
plaining  how  changes  to  an  algorithm  affect  the  final 
error  rates.  It  allows  us  to  decompose  the  error  into 
meaningful  components  and  to  see  how  the  error  com¬ 
ponents  change  with  variations  in  the  algorithm. 

Several  researchers  have  used  the  bias- variance  analy¬ 
sis  of  error  to  show  how  multiple  model  approaches 
work.  For  example,  both  Breiman  (1996b)  and 
Schapire  et  al.  (1997)  showed  that  Bagging  improves 
performance  by  reducing  the  variance  component  of 
error.  Kong  and  Dietterich  (1996)  showed  that  ECOC 
could  reduce  both  bias  and  variance. 

The  bias  variance  decomposition  of  error  originated 
in  squared  error  for  regression.  For  classification,  0-1 
loss  (misclassification  rate)  is  commonly  used,  but  this 
does  not  have  a  straightforward  or  unique  decomposi¬ 
tion.  Recently,  many  authors  have  proposed  similar 
decompositions  (Kong  and  Dietterich,  1996;  Breiman, 
1996b;  James  and  Hastie,  1997;  Tibshirani,  1996;  Ko- 
havi  and  Wolpert,  1996). 

We  used  Kong  and  Dietterich’s  (1996)  definitions. 
They  define  bias  to  be  “the  error  of  the  ideal  voted  hy¬ 
pothesis,”  which  is  the  result  we  would  get  from  com¬ 
bining  an  infinite  number  of  classifiers,  each  trained 
on  an  independent  set  of  examples.  Variance  is  the 


“difference  between  the  expected  error  rate  and  the 
ideal  voted  hypothesis  error  rate.”  Formally,  where  A 
is  the  algorithm,  m  is  the  training  set  size,  x  is  the 
unknown  test  point,  /(x)  is  the  class  of  x,  /*(x)  is  the 
ideal  voted  hypothesis  of  the  algorithm  A  at  x,  and 
Error{A,  m,  x)  is  the  expected  error  of  algorithm  A  at 
X  using  training  sets  of  size  m,  then  bias  and  variance 
are: 

Bias{A,  m,  x)  -  I  j  ^  (  ) 


Variance(A,  m,  x)  =  Error{A,  m,  x)  -  Bias{A,  m,  x) 

(2) 

Note  that  the  Bayes  error  is  incorporated  into  the  bias 
error.  Also,  the  variance  can  be  negative.  This  oc¬ 
curs  when  the  algorithm  is  usually  wrong,  but  makes 
a  lucky  guess  and  predicts  the  correct  class. 

We  investigated  the  bias-variance  components  of  error 
on  three  datasets  originally  used  by  Breiman  (1996b) 
and  later  by  Schapire  et.  al  (1997)  to  evaluate  mul¬ 
tiple  model  approaches.  The  datasets  are  two  class 
problems,  with  the  individual  classes  composed  of  20- 
dimensional  gaussians. 

We  compared  four  classifiers:  NN,  kNN,  MFSl  with 
1  classifier  (1-MFSl),  and  MFSl  with  100  classifiers. 
The  NN  classifier  is  the  control,  to  which  we  can  com¬ 
pare  the  kNN  and  MFS  algorithms.  1-MFSl  should 
allow  us  to  determine  the  changes  to  the  error  compo¬ 
nents  that  are  caused  by  random  feature  selection  and 
the  changes  that  are  caused  by  voting  among  multiple 
classifiers. 

We  used  a  test  set  of  3000  instances  and  100  inde¬ 
pendent  training  sets  of  size  300  to  estimate  the  bias, 
variance,  and  error  of  the  four  classifiers.  We  approx¬ 
imated  /*  (x)  by  voting  over  the  classifiers  trained  on 
the  100  independent  training  sets.  The  results  are 
shown  in  Table  4. 

In  Twonorm  and  Threenorm,  selecting  a  single  ran¬ 
dom  subset  of  features  (1-MFSl)  destabilizes  the  NN 
classifier  and  causes  the  variance  error  to  significantly 
increase.  During  voting  (MFSl)  the  variance  error  is 
reduced  to  a  much  smaller  value  than  the  variance  of 
the  original  NN  classifier,  thus  reducing  the  overall  er¬ 
ror  significantly. 

For  Ringnorm,  the  feature  selection  process  does  a  dra¬ 
matic  trade  of  bias  for  variance.  The  bias  error  drops 
from  47.1%  to  only  4.6%,  while  the  variance  increases 
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Table  3:  Accuracy  of  kNN  and  MFS  Under  Corruption  by  Irrelevant  Features 


Domain 

0 

10 

kNN 

20 

30 

0 

MFSl 

10  20 

30 

Breast  Cancer 

74.3 

71.0 

70.3 

69.8 

74.0 

71.5 

71.3 

70.5 

German 

73.1 

72.0 

70.9 

70.5 

74.4 

72.6 

71.3 

70.7 

Ionosphere 

85.5 

73.7 

71.7 

69.5 

93.5 

91.3 

91.4 

90.1 

Soybean-Large 

90.5 

80.6 

75.2 

71.1 

93.4 

87.7 

81.2 

76.9 

Vehicle 

67.7 

37.8 

35.5 

34.1 

71.4 

69.7 

66.0 

64.2 

Vote 

93.1 

91.8 

91.1 

90.9 

94.9 

93.0 

92.0 

91.3 

Wine 

96.7 

72.5 

62.2 

61.2 

97.6 

96.9 

93.7 

91.8 

Table  4:  Bias  Variance  Decomposition  of  Error 


Domain 

Opt. 

NN 

1-MFSl 

MFSl 

kNN 

Twonorm 

bias 

2.3 

2.4 

2.6 

2.4 

2.4 

variance 

- 

4.9 

17.8 

1.3 

1.0 

error 

2.3 

7.3 

20.4 

3.7 

3.4 

Threenorm 

bias 

10.5 

10.5 

11.6 

10.42 

11.2 

variance 

- 

13.6 

22.5 

6.3 

4.4 

error 

10.5 

24.1 

34.1 

16.8 

15.6 

Ringnorm 

bias 

1.3 

47.1 

4.6 

3.7 

47.1 

variance 

- 

-7.9 

25.8 

2.0 

-7.9 

error 

1.3 

39.2 

30.4 

5.7 

39.2 

from  -7.9%  to  25.8%.  Voting  then  drops  the  variance 
to  only  2%  greatly  improving  accuracy. 

From  these  datasets,  we  see  that  MFS  has  two  modes 
of  operation:  (1)  decreasing  variance  through  voting, 
and  (2)  trading  bias  for  variance  through  random  fea¬ 
ture  selection.  Taken  together,  MFS  is  able  to  reduce 
both  bias  and  variance  components  of  error. 

In  comparison  to  MFS,  the  kNN  classifier  reduced  only 
variance.  On  Twonorm  and  Threenorm  the  error  of 
NN  was  dominated  by  variance  (the  bias  error  was 
nearly  optimal)  and  like  MFS,  kNN  was  able  decrease 
error  by  reducing  the  variance.  In  fact,  kNN  did  a 
better  job  than  MFS  at  variance  reduction.  On  Ring- 
norm,  the  error  of  the  NN  classifier  was  dominated  by 
bias  and  kNN  was  not  able  to  improve  performance. 


^The  value  for  bias  should  always  be  greater  than  or 
equal  to  the  Bayes  error  rate  (10.5%),  however,  because 
of  estimation  error  from  finite  sample  sizes,  it  is  possible 
to  obtain  bias  estimates  which  are  lower  than  the  optimal 
bound. 


5  RELATED  WORK 

Although  there  is  a  large  body  of  research  on  multi¬ 
ple  model  methods  for  classification,  very  little  specif¬ 
ically  deals  with  combining  NN  classifiers.  We  are 
only  aware  of  Skalak’s  (1996)  work  on  combining  NN 
classifiers  with  small  prototype  sets,  Alpaydin’s  (1997) 
work  with  condensed  nearest  neighbor  (CNN)  classi¬ 
fiers  (Hart,  1968),  and  Ricci  and  Aha’s  (1998)  work  on 
combining  NN,  feature  selection,  and  ECOC. 

Skalak  and  Alpaydin  approach  the  problem  of  combin¬ 
ing  NN  classifiers  similarly.  They  drastically  reduce 
the  size  of  each  classifier’s  prototype  set  to  destabilize 
the  NN  classifier.  Skalak  investigates  several  differ¬ 
ent  strategies  for  finding  a  reduced  prototype  set  and 
even  pursues  an  approach  called  “radical  destabiliza¬ 
tion”  where  the  NN  classifier  has  just  a  single  proto¬ 
type  per  class.  He  was  able  to  improve  accuracy  over 
the  baseline  NN  classifier  in  10  of  13  UCI  domains. 
Interestingly,  MFS  did  well  on  Glass  and  Lymphog¬ 
raphy  (average  increase  of  over  7%  compared  to  the 
NN  classifier);  these  are  two  domains  where  Skalak  re¬ 
ported  that  no  combining  algorithm  improved  perfor¬ 
mance.  Alpaydin  uses  dataset  partitioning  (bootstrap 
or  disjoint)  in  combination  with  the  CNN  classifier  to 
edit  and  reduce  the  prototypes.  He  also  reported  im¬ 
provements  over  the  NN  classifier  if  the  training  sets 
were  sufficiently  small  and  thus  able  to  generate  di¬ 
verse  classifiers. 

Ricci  and  Aha  (1998)  applied  ECOC  to  the  NN  clas¬ 
sifier  (NN-ECOC).  Normally,  applying  ECOC  to  NN 
would  not  work  as  the  errors  in  the  two-class  problems 
would  be  highly  correlated;  however,  they  found  that 
applying  feature  selection  to  the  two-class  problems 
decorrelated  errors  if  different  features  were  selected. 
With  this  method  they  were  able  to  improve  perfor¬ 
mance  in  many  of  the  domains  tested,  and  they  noted 
that  ECOC  accuracy  gains  tended  to  increase  with  in- 
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creased  diversity  among  the  features  selected  for  the 
two-class  problems. 

NN-ECOC  is  similar  to  MFS  as  they  both  use  NN 
classifiers  with  different  features.  They  differ  in  that 
NN-ECOC  uses  active  selection  of  features  (and  out¬ 
put  coding)  while  MFS  uses  random  selection.  A  head 
to  head  comparison  would  be  useful  to  determine  if 
NN-ECOC  and  MFS  achieve  their  accuracy  gains  in 
the  same  areas  of  the  feature  space.  Ricci  and  Aha 
also  analyzed  NN-ECOC  for  bias  and  variance  and 
concluded  that  NN-ECOC  reduces  bias  but  slightly 
increases  variance.  Unfortunately,  because  we  used 
different  a  definition  of  bias  and  variance  our  results 
are  not  directly  comparable. 

Regardless  of  which  method  has  better  accuracy,  MFS 
appears  to  have  two  main  advantages  over  NN-ECOC: 
(1)  MFS  is  the  simpler  algorithm,  and  (2)  MFS  is  not 
constrained  by  ECOC  to  multiclass  problems. 

6  CONCLUSIONS  AND  FUTURE 
WORK 

We  introduced  MFS,  a  new  algorithm  for  combining 
multiple  NN  classifiers.  In  MFS,  each  NN  classifier  has 
access  to  all  the  patterns  in  the  original  training  set 
but  only  to  a  random  subset  of  the  features. 

Our  experiments  showed  that  MFS  was  effective  in 
improving  accuracy.  But  beyond  accuracy  improve¬ 
ments,  MFS  is  a  significant  advance  because  it  allows 
us  to  incorporate  many  desirable  properties  of  the  NN 
classifier  in  a  multiple  model  framework.  For  example, 
one  of  the  primary  advantages  of  the  NN  classifier  is 
its  ability  to  incrementally  add  new  data  (or  remove 
old  data)  without  requiring  retraining.  MFS  maintains 
this  property  and  new  data  can  be  added  (old  data  re¬ 
moved)  at  runtime.  Another  useful  property  of  the 
NN  classifier  is  its  ability  to  predict  directly  from  the 
training  data  without  using  intermediate  structures. 
As  a  result,  no  matter  how  many  classifiers  we  com¬ 
bine  in  MFS,  we  require  only  the  same  memory  as  a 
single  NN  classifier.  (The  combined  NN  classifiers  can 
share  a  common  dataset,  and  the  features  are  selected 
randomly  at  runtime.) 

MFS  has  disadvantages  and  it  should  not  be  used  in- 
discriminantly.  In  particular,  MFS  loses  the  asymp¬ 
totic  optimality  properties  of  the  NN  and  kNN  classi¬ 
fiers.  Additionally,  on  domains  with  highly  interacting 
features,  such  as  Tic-Tac-Toe,  the  error  rate  can  in¬ 
crease  too  much  in  the  feature  subsets  resulting  in  poor 
ensemble  performance.  As  with  all  multiple  model  ap¬ 


proaches,  we  lose  comprehensibility  compared  to  a  sin¬ 
gle  model.  The  individual  must  judge  if  the  potential 
accuracy  increases  is  worth  these  disadvantages. 

MFS  is  our  first  attempt  at  using  random  feature  selec¬ 
tion  to  generate  effective  NN  ensembles,  and  although 
successful  at  improving  accuracy,  there  are  still  many 
unanswered  questions  and  open  areas  for  future  work: 

1.  Why  does  MFS  work?  We  made  an  initial  at¬ 
tempt  at  answering  this  question  with  our  anal¬ 
ysis  of  irrelevant  features  and  the  bias-variance 
decomposition  of  error.  But  clearly  more  work 
needs  to  be  done  as  we  cannot  even  characterize 
the  domains  MFS  will  do  well  on. 

2.  Application  to  other  classifiers.  We  showed  that 
random  feature  selection  is  useful  for  generating 
ensembles  of  NN  classifiers.  Can  we  apply  this 
technique  to  other  learning  algorithms? 

3.  Implications  for  feature  selection  and  feature 
weighting.  The  experimental  results  showed  that 
combining  multiple  random  feature  subsets  can 
significantly  improve  performance  over  the  single 
best  subset  of  features  found  by  FSS  or  BSS.  This 
implies  that  instead  of  searching  for  the  single  best 
set  of  features,  we  should  be  searching  for  multiple 
feature  sets  that  work  well  together. 

4.  Other  Improvements.  In  this  paper,  we  kept  the 
design  of  MFS  as  simple  as  possible;  however, 
there  are  a  number  of  obvious  improvements  that 
may  help  accuracy  and  speed.  In  particular,  we 
would  like  to  investigate:  (1)  different  weighting 
schemes,  (2)  varying  the  number  of  features  each 
classifier  uses,  (3)  postpruning  the  ensemble,  (4) 
combining  more  sophisticated  versions  of  the  NN 
classifier,  and  (5)  editing  the  prototypes. 
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Abstract 

Predicting  items  a  user  would  like  on  the  basis  of 
other  users’  ratings  for  these  items  has  become  a 
well-established  strategy  adopted  by  many  rec¬ 
ommendation  services  on  the  Internet.  Although 
this  can  be  seen  as  a  classification  problem,  algo¬ 
rithms  proposed  thus  far  do  not  draw  on  results 
from  the  machine  learning  literature.  We  propose 
a  representation  for  collaborative  filtering  tasks 
that  allows  the  application  of  virtually  any  ma¬ 
chine  learning  algorithm.  We  identify  the  short¬ 
comings  of  current  collaborative  filtering  tech¬ 
niques  and  propose  the  use  of  learning  algo¬ 
rithms  paired  with  feature  extraction  techniques 
that  specifically  address  the  limitations  of  previ¬ 
ous  approaches.  Our  best-performing  algorithm 
is  based  on  the  singular  value  decomposition  of 
an  initial  matrix  of  user  ratings,  exploiting  latent 
structure  that  essentially  eliminates  the  need  for 
users  to  rate  common  items  in  order  to  become 
predietors  for  one  another's  preferences.  We 
evaluate  the  proposed  algorithm  on  a  large  data¬ 
base  of  user  ratings  for  motion  pictures  and  find 
that  our  approach  significantly  outperforms  cur¬ 
rent  collaborative  filtering  algorithms. 


1  INTRODUCTION 

Research  on  intelligent  information  agents  in  general,  and 
recommendation  systems  in  particular,  has  recently  at¬ 
tracted  much  attention.  The  reasons  for  this  are  twofold. 
First,  the  amount  of  information  available  to  individuals  is 
growing  steadily.  Information  overload  has  become  a 
popular  buzzword  of  our  times  and  people  feel  over¬ 
whelmed  when  navigating  through  today's  information 
and  media  landscape.  This  leads  to  a  clear  demand  for 
automated  methods,  commonly  referred  to  as  intelligent 
information  agents,  that  locate  and  retrieve  information 


with  respect  to  users’  individual  preferences.  Second,  the 
number  of  users  accessing  the  Internet  is  also  growing. 
Not  only  does  this  lead  to  an  incredible  variety  of  subjects 
that  can  be  learned  about  online,  it  opens  up  new  possi¬ 
bilities  to  organize  and  recommend  information.  The  cen¬ 
tral  idea  here  is  to  base  personalized  recommendations  for 
users  on  information  obtained  from  other,  ideally  like- 
minded,  users.  This  is  commonly  known  as  collaborative 
filtering  or  social  filtering. 

The  underlying  techniques  used  in  today's  recommenda¬ 
tion  systems  fall  into  two  distinct  categories:  content- 
based  and  collaborative  methods.  Content-ba.sed  methods 
require  textual  descriptions  of  the  items  to  be  recom¬ 
mended  and  draw  on  results  from  both  information  re¬ 
trieval  and  machine  learning  research  (e.g.,  Pazzani  and 
Billsus,  1997).  In  general,  a  content-based  system  ana¬ 
lyzes  a  set  of  documents  rated  by  an  individual  user  and 
uses  the  content  of  these  documents,  as  well  as  the  pro¬ 
vided  ratings,  to  infer  a  profile  that  can  be  used  to  rec¬ 
ommend  additional  items  of  interest.  In  contrast,  collabo¬ 
rative  methods  recommend  items  based  on  aggregated 
user  ratings  of  those  items,  i.e.  these  techniques  do  not 
depend  on  the  availability  of  textual  descriptions.  Both 
approaches  share  the  common  goal  of  assisting  in  the 
user’s  search  for  items  of  interest,  and  thus  attempt  to 
address  one  of  the  key  research  problems  of  the  informa¬ 
tion  age:  locating  needles  in  a  haystack  that  is  growing 
exponentially. 

In  this  paper  we  focus  on  collaborative  filtering  tech¬ 
niques.  A  variety  of  algorithms  have  previously  been  re¬ 
ported  in  the  literature  and  their  promising  performance 
has  been  evaluated  empirically  (Shardanand  and  Maes, 
1995;  Hill  et  al.  1995;  Resnick  et  al.  1994).  These  results, 
and  the  continuous  increase  of  people  connected  to  the 
Internet,  led  to  the  development  and  employment  of  nu¬ 
merous  collaborative  filtering  systems.  Virtually  all  topics 
that  could  be  of  potential  interest  to  users  are  covered  by 
special-purpose  recommendation  systems:  web  pages, 
news  stories,  movies,  music  videos,  books,  CDs,  restau¬ 
rants,  and  many  more.  Some  of  the  best-known  represen- 
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tatives  of  these  systems,  such  as  FireFly 
(www.firefly.com)  or  WiseWire  (www.wisewire.com) 
have  turned  into  commercial  enterprises.  Furthermore, 
collaborative  filtering  techniques  are  becoming  increas¬ 
ingly  popular  as  part  of  online  shopping  sites.  These  sites 
incorporate  recommendation  systems  that  suggest  prod¬ 
ucts  to  users  based  on  products  that  like-minded  users 
have  ordered  before,  or  indicated  as  interesting.  For  ex¬ 
ample,  users  can  find  out  which  CD  they  should  order 
from  an  online  CD  store  if  they  provide  information  about 
their  favorite  artists,  and  several  online  bookstores  (e.g. 
amazon.com)  can  associate  their  available  titles  with  other 
titles  that  were  ordered  by  like-minded  people. 

Although  there  seems  to  be  an  increasingly  strong  de¬ 
mand  for  collaborative  filtering  techniques,  only  a  few 
different  algorithms  have  been  proposed  in  the  literature 
thus  far.  Furthermore,  the  reported  algorithms  are  based 
on  rather  simple  predictive  techniques.  Although  collabo¬ 
rative  filtering  can  be  seen  as  a  classification  task,  the 
problem  has  not  received  much  attention  in  the  machine 
learning  community.  It  seems  likely  that  predictive  per¬ 
formance  can  be  increased  through  the  development  of 
special-purpose  algorithms  that  draw  on  results  from  the 
machine  learning  literature. 

This  paper  can  be  outlined  as  follows.  We  briefly  present 
the  central  ideas  of  previously  reported  collaborative  fil¬ 
tering  algorithms.  We  identify  the  main  shortcomings  of 
these  approaches  and  motivate  the  need  for  techniques 
that  do  not  suffer  from  these  limitations.  We  then  explain 
how  the  task  of  computing  collaborative  recommenda¬ 
tions  can  be  represented  as  a  classification  task.  Within 
this  framework  we  present  a  learning  algorithm  that  ad¬ 
dresses  the  limitations  of  previous  approaches.  The  pro¬ 
posed  method  is  based  on  dimensionality  reduction 
through  the  singular  value  decomposition  (SVD)  of  an 
initial  matrix  of  user  ratings,  exploiting  latent  structure 
that  essentially  eliminates  the  need  for  users  to  rate  com¬ 
mon  items  in  order  to  become  predictors  for  one  another’s 
preferences.  An  artificial  neural  network  is  used  to  com¬ 
pute  final  recommendations.  We  evaluate  our  algorithm 
on  a  large  database  of  user  ratings  for  motion  pictures  and 
show  that  it  significantly  outperforms  previously  pro¬ 
posed  algorithms. 

2  COLLABORATIVE  FILTERING 
ALGORITHMS 

In  this  section  we  briefly  outline  the  main  ideas  of  col¬ 
laborative  filtering  algorithms  reported  in  the  literature. 
Shardanand  and  Maes,  1995,  discuss  a  variety  of  social 
filtering  algorithms  and  evaluate  them  in  the  context  of 
their  music  recommendation  system  Ringo  (predecessor 
to  FireFly).  These  algorithms  are  based  on  a  simple  intui¬ 
tion:  predictions  for  a  user  should  be  based  on  the  simi¬ 


larity  between  the  interest  profile  of  that  user  and  those  of 
other  users.  Therefore,  the  first  step  of  these  algorithms  is 
to  compute  similarities  between  user  profiles.  Suppose  we 
have  a  database  of  user  ratings  for  items,  where  users  in¬ 
dicate  their  interest  in  an  item  on  a  numeric  scale.  It  is 
now  possible  to  define  similarity  measures  between  two 
user  profiles,  U  and  J,  where  a  user  profile  simply  con¬ 
sists  of  a  vector  of  numeric  ratings.  A  measure  proposed 
by  Shardanand  and  Maes  is  the  Pearson  correlation  coef¬ 
ficient,  ruj.  Once  the  similarity  between  profiles  has  been 
quantified,  it  can  be  used  to  compute  personalized 
recommendations  for  users.  All  users  whose  similarity  is 
greater  than  a  certain  threshold  t  are  identified  and 
predictions  for  an  item  are  computed  as  the  weighted 
average  of  the  ratings  of  those  similar  users  for  the  item, 
where  the  weight  is  the  computed  similarity.  Note  that 
this  prediction  scheme  leads  to  cases  where  predictions 
cannot  be  computed  for  all  items  in  the  database.  If  the 
threshold  t  is  set  to  a  high  value,  only  a  few  very  similar 
users  are  considered  and  it  becomes  increasingly  likely 
that  ratings  for  some  specific  item  are  not  available.  In 
order  to  avoid  this  problem,  (Resnick  et  al.,  1994) 
compute  predictions  according  to  the  following  formula, 
where  14  is  a  rating  to  be  predicted  for  User  U  on  item  x 
and  ruj  is  the  correlation  between  users  U  and  J. 

jj  _ rr  I  J^Ratersof X 

MFatersof  X 

where 

-uf  -7f 

If  no  ratings  for  item  x  are  available,  the  prediction  is 
equivalent  to  the  mean  of  all  ratings  from  user  U.  Similar 
algorithms  were  reported  and  evaluated  in  (Hill  et  al. 
1995). 

While  these  correlation-based  prediction  schemes  were 
shown  to  perform  well,  they  suffer  from  several  limita¬ 
tions.  Here,  we  identify  three  specific  problems:  First, 
correlation  between  two  user  profiles  can  only  be  com¬ 
puted  based  on  items  that  both  users  have  rated,  i.e.  the 
summations  and  averages  in  the  correlation  formula  are 
only  computed  over  those  items  that  both  users  have 
rated.  If  users  can  choose  among  thousands  of  items  to 
rate,  it  is  likely  that  overlap  of  rated  items  between  two 
users  will  be  small  in  many  cases.  Therefore,  many  of  the 
computed  correlation  coefficients  are  based  on  just  a  few 
observations,  and  thus  the  computed  correlation  cannot  be 
regarded  as  a  reliable  measure  of  similarity.  For  example, 
a  correlation  coefficient  based  on  three  observations  has 
as  much  influence  on  the  final  prediction  as  a  coeflScient 
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based  on  30  observations.  Second,  the  correlation  ap¬ 
proach  induces  one  global  model  of  similarities  between 
users,  rather  than  separate  models  for  classes  of  ratings 
(e.g.  positive  rating  vs.  negative  rating).  Current  ap¬ 
proaches  measure  whether  two  user  profiles  are  positively 
correlated,  not  correlated  at  all  or  negatively  correlated. 
However,  ratings  given  by  one  user  can  still  be  good  pre¬ 
dictors  for  ratings  of  another  user,  even  if  the  two  user 
profiles  are  not  correlated.  Consider  the  case  where  user 
A’s  positive  ratings  are  a  perfect  predictor  for  a  negative 
rating  from  user  B.  However,  user  A’s  negative  ratings  do 
not  imply  a  positive  rating  from  user  B,  i.e.  the  correlation 
between  the  two  profiles  could  be  close  to  zero,  and  thus 
potentially  useful  information  is  lost.  Third,  and  maybe 
most  importantly,  two  users  can  only  be  similar  if  there  is 
overlap  among  the  rated  items,  i.e.  if  users  did  not  rate 
any  common  items,  their  user  profiles  cannot  be  corre¬ 
lated.  Due  to  the  enormous  number  of  items  available  to 
rate  in  many  domains,  this  seems  to  be  a  serious  stum¬ 
bling  block  for  many  filtering  services,  especially  during 
the  startup  phase.  However,  just  knowing  that  users  did 
not  rate  the  same  items  does  not  necessarily  mean  that 
they  are  not  like-minded.  Consider  the  following  exam¬ 
ple:  Users  A  and  B  are  highly  correlated,  as  are  users  B 
and  C.  This  relationship  provides  information  about  the 
similarity  between  users  A  and  C  as  well.  However,  in 
case  users  A  and  C  did  not  rate  any  common  items,  a  cor¬ 
relation-based  similarity  measure  could  not  detect  any 
relation  between  the  two  users.  We  believe  that  poten¬ 
tially  useful  information  is  lost  if  this  kind  of  transitive 
similarity  relation  cannot  be  detected. 

3  COLLABORATIVE  FILTERING  AS  A 
CLASSIFICATION  PROBLEM 

In  this  section  we  present  collaborative  filtering  in  a  ma¬ 
chine  learning  framework  and  suggest  the  use  of  an  algo¬ 
rithm  that  specifically  addresses  the  aforementioned 
limitations  of  correlation-based  approaches. 

Collaborative  filtering  can  be  seen  as  a  classification  task. 
Based  on  a  set  of  ratings  from  users  for  items,  we  are 
trying  to  induce  a  model  for  each  user  that  allows  us  to 
classify  unseen  items  into  two  or  more  classes,  for  exam¬ 
ple  like  and  dislike.  Alternatively,  if  our  goal  is  to  predict 
user  ratings  on  a  continuous  scale,  we  have  to  solve  a 
regression  problem. 

Our  initial  data  exists  in  the  form  of  a  sparse  matrix, 
where  rows  correspond  to  users,  columns  correspond  to 
items  and  the  matrix  entries  are  ratings.  Note  that  sparse 
in  this  context  means  that  most  elements  of  the  matrix  are 
empty,  because  every  user  typically  rates  only  a  very 
small  subset  of  all  possible  items.  The  prediction  task  can 
now  be  seen  as  filling  in  the  missing  matrix  values.  Since 
we  are  interested  in  learning  personalized  models  for  each 


user,  we  associate  one  classifier  (or  regression  model) 
with  every  user.  This  model  can  be  used  to  predict  the 
missing  values  for  one  row  in  our  matrix. 


Table  1:  Exemplary  User  Ratings 


h 

h 

I4 

Is 

u, 

4 

3 

u. 

1 

2 

3 

4 

2 

4 

U4 

4 

2 

1 

•? 

With  respect  to  Table  1,  consider  that  we  would  like  to 
predict  user  4’s  rating  for  item  5.  We  can  train  a  learning 
algorithm  with  the  information  that  we  have  about  user 
4’s  previous  ratings.  In  this  example  user  4  has  provided  3 
ratings,  which  leads  to  3  training  examples;  I,,  I2,  and  I3. 
These  training  examples  can  be  directly  represented  as 
feature  vectors,  where  users  correspond  to  features  ((/;, 
U2,  U})  and  the  matrix  entries  correspond  to  feature  val¬ 
ues.  User  4’s  ratings  for  //,  I2  and  I3  are  the  class  labels  of 
the  training  examples.  However,  in  this  representation  we 
would  have  to  address  the  problem  of  many  missing  fea¬ 
ture  values.  If  the  learning  algorithm  to  be  used  cannot 
handle  missing  feature  values,  we  can  apply  a  simple 
transformation.  Note  that  we  cannot  introduce  an  addi¬ 
tional  numeric  value  that  indicates  a  missing  feature,  be¬ 
cause  this  would  conflate  the  new  value  and  the  observed 
ratings.  However,  every  user  can  be  represented  by  up  to 
n  Boolean  features,  where  n  is  the  number  of  points  on  the 
scale  that  is  used  for  ratings.  For  example,  if  the  full  n- 
point  scale  of  ratings  is  used  to  represent  ratings  from  m 
users,  the  resulting  Boolean  features  are  of  the  form  “User 
m's  rating  was  f’,  where  0  <  i  <  n.  We  can  now  assign 
Boolean  feature  values  to  all  of  these  new  features.  If  this 
representation  leads  to  an  excessive  number  of  features 
that  only  appear  rarely  throughout  the  data,  the  rating 
scale  can  be  further  discretized,  e.g.  into  the  two  classes 
like  and  dislike.  The  resulting  representation  is  simple  and 
intuitive:  a  training  example  E  corresponds  to  an  item  that 
the  user  has  rated,  the  class  label  C  is  the  user’s  discre¬ 
tized  rating  for  that  item,  and  items  arc  represented  as 
vectors  of  Boolean  features  F-,. 


Table  2:  Exemplary  Feature  Vectors 
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Table  2  shows  the  resulting  Boolean  feature  vectors  (true 
=  1  mi  false  =  0)  for  user  4,  where  a  rating  of  either  1  or 
2  corresponds  to  the  class  dislike,  and  a  rating  of  either  3 
or  4  corresponds  to  the  class  like. 

After  converting  a  data  set  of  user  ratings  for  items  into 
this  format,  we  can  draw  on  the  machine  learning  litera¬ 
ture  and  apply  virtually  any  supervised  learning  algorithm 
that,  through  analysis  of  a  labeled  training  sample  T  =  (Ej, 
Cj},  can  induce  a  function/.-  C. 

However,  if  we  look  back  at  the  correlation-based  ap¬ 
proaches  described  earlier  and  express  them  in  our  learn¬ 
ing  framework,  we  notice  that  these  algorithms  solve  a 
classification  problem  in  a  somewhat  unconventional 
way.  If  features  and  classes  are  represented  as  ordinal 
values  (no  discretization),  these  algorithms  measure  the 
degree  of  correlation  between  features  and  class  labels. 
Predictions  for  unseen  examples  are  then  computed  as  a 
weighted  average  of  feature  values.  While  this  approach 
seems  to  work  reasonably  well  for  the  domain  at  hand,  it 
is  not  supported  by  a  sound  theory  that  we  could  use  to 
motivate  the  algorithms’  use  for  either  a  classification  or 
regression  task.  It  comes  as  no  surprise  that  researchers  in 
machine  learning  have  thus  far  not  attempted  to  solve  any 
task  with  this  algorithm.  It  seems  likely  that  theoretically 
well-founded  algorithms  that  have  the  discrimination 
between  classes  as  their  specific  goal,  can  outperform 
correlation-based  approaches. 

3.1  REDUCING  DIMENSIONALITY 

Our  goal  is  to  construct  or  apply  algorithms  that  address 
the  previously  identified  limitations  of  correlation-based 
approaches.  As  mentioned  earlier,  the  computation  of 
correlation  coefficients  can  be  based  on  too  little  infor¬ 
mation,  leading  to  inaccurate  similarity  estimates.  When 
applying  a  learning  algorithm,  we  would  like  to  avoid  this 
problem.  In  particular,  we  would  like  to  discard  informa¬ 
tion  that  we  do  not  consider  informative  for  our  classifi¬ 
cation  task.  Likewise,  we  would  like  to  be  able  to  take 
possible  interaction  and  dependencies  among  features  into 
account,  as  we  regard  this  as  an  essential  prerequisite  for 
users  to  become  predictors  for  one  another's  preferences 
even  without  rating  common  items.  Both  of  these  issues 
can  be  addressed  through  the  application  of  appropriate 
feature  extraction  techniques.  Furthermore,  the  need  for 
dimensionality  reduction  is  of  particular  importance  if  we 
represent  our  data  in  the  proposed  learning  framework. 
For  large  databases  containing  many  users  we  will  end  up 
with  thousands  of  features  while  our  amount  of  training 
data  is  very  limited.  Learning  under  these  conditions  is 
not  practical,  because  the  amount  of  data  points  needed  to 
approximate  a  concept  in  d  dimensions  grows  exponen¬ 
tially  with  d,  a  phenomenon  commonly  referred  to  as  the 
curse  of  dimensionality  (Bellman,  1961).  This  is,  of 
course,  not  a  problem  unique  to  collaborative  filtering. 


Other  domains  with  very  similar  requirements  include  the 
classification  of  natural  language  text  or,  in  general,  any 
information  retrieval  task.  In  these  domains  the  similarity 
among  text  documents  needs  to  be  measured.  Ideally,  two 
text  documents  should  be  similar  if  they  discuss  the  same 
subject  or  contain  related  information.  However,  it  is  of¬ 
ten  not  sufficient  to  base  similarity  on  the  overlap  of 
words.  Two  documents  can  very  well  discuss  similar 
subjects,  but  use  a  somewhat  different  vocabulary.  A  low 
number  of  common  words  should  not  imply  that  the 
documents  are  not  related.  This  is  very  similar  to  the 
problem  we  are  facing  in  collaborative  filtering:  the  fact 
that  two  users  rated  different  items  should  not  imply  that 
they  are  not  like-minded.  Researchers  in  information  re¬ 
trieval  have  proposed  different  solutions  to  the  text  ver¬ 
sion  of  this  problem.  One  of  these  approaches.  Latent 
Semantic  Indexing  (LSI)  (Deerwester  et  al.,  1990)  is 
based  on  dimensionality  reduction  of  the  initial  data 
through  singular  value  decomposition  (SVD).  We  will 
now  show  how  the  SVD  can  be  used  as  a  dimensionality 
reduction  technique  for  our  collaborative  filtering  task.  A 
more  detailed  description  of  underlying  algebraic  princi¬ 
ples  can  be  found  in  (Berry  et  al.,  1994). 

3.2  COLLABORATIVE  FILTERING  AND  THE 
SVD 

We  start  our  analysis  based  on  a  rectangular  matrix  con¬ 
taining  Boolean  values  that  indicate  user  ratings  for  items 
(see  Table  2).  This  matrix  is  typically  very  sparse,  where 
sparse  means  that  most  elements  are  zero,  because  each 
item  is  only  rated  by  a  small  subset  of  all  users.  Further¬ 
more,  many  features  appear  infrequently  or  do  not  appear 
at  all  throughout  this  matrix.  However,  features  will  only 
affect  the  SVD  if  they  appear  at  least  twice.  Therefore,  we 
apply  a  first  preprocessing  step  and  remove  all  features 
that  appear  less  than  twice  in  our  training  data.  The  result 
of  this  preprocessing  step  is  a  matrix  A  containing  zeros 
and  ones,  with  at  least  two  ones  in  every  row.  Using  the 
SVD,  the  initial  matrix  A  with  r  rows,  c  columns  and  rank 
m  can  be  decomposed  into  the  product  of  three  matrices: 

A  =  U'LV^ 

where  the  columns  of  U  and  V  are  orthonormal  vectors 
that  define  the  left  and  right  singular  vectors  of  A,  and  2’is 
a  diagonal  matrix  containing  corresponding  singular 
values.  Since  the  derived  vectors  are  orthonormal,  no 
vector  can  be  reconstructed  as  a  linear  combination  of  the 
others.  (/  is  an  m  x  c  matrix  and  the  singular  vectors  cor¬ 
respond  to  columns  of  the  original  matrix.  V  is  an  r  x  m 
matrix  and  the  singular  vectors  correspond  to  rows  of  the 
original  matrix.  The  singular  values  quantify  the  amount 
of  variance  in  the  original  data  captured  by  the  singular 
vectors.  This  representation  provides  an  ideal  framework 
for  dimensionality  reduction,  because  one  can  now  quan¬ 
tify  the  amount  of  information  that  is  lost  if  singular  val- 
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ues  and  their  corresponding  singular  vector  elements  are 
discarded.  The  smallest  singular  values  are  set  to  zero, 
reducing  the  dimensionality  of  the  new  data  representa¬ 
tion.  The  underlying  intuition  is  that  the  n  largest  singular 
values  together  with  their  corresponding  singular  vector 
elements  capture  the  important  "latent"  structure  of  the 
initial  matrix,  whereas  random  fluctuations  are  elimi¬ 
nated.  The  usefulness  of  the  SVD  for  our  task  can  be  fur¬ 
ther  explained  by  its  geometric  interpretation.  If  we 
choose  to  retain  the  k  largest  singular  values,  we  can  in¬ 
terpret  the  singular  vectors,  scaled  by  the  singular  values, 
as  coordinates  of  points  representing  the  rows  and  col¬ 
umns  of  the  original  matrix  in  k  dimensions.  In  our  con¬ 
text,  the  goal  of  this  transformation  is  to  find  a  spatial 
configuration  such  that  items  and  user  ratings  are  repre¬ 
sented  by  points  in  /c-dimensional  space,  where  every  item 
is  placed  at  the  centroid  of  every  user  rating  that  it  re¬ 
ceived  and  every  user  rating  is  placed  at  the  centroid  of  all 
the  items  that  it  was  assigned  to.  While  the  position  of 
vectors  in  this  k-dimensional  space  is  determined  through 
the  assignment  of  ratings  to  items,  items  can  still  be  close 
in  this  space  even  without  containing  any  common  rat¬ 
ings.  Likewise,  user  ratings  can  be  close  to  each  other, 
although  they  were  never  assigned  to  a  common  set  of 
items.  Many  different  strategies  for  classification  of  items 
are  theoretically  possible  using  this  /c-dimensional  repre¬ 
sentation.  We  will  now  describe  the  complete  algorithm 
for  item  classification  that  we  used  in  our  experiments. 

3.3  USING  SINGULAR  VECTORS  AS  TRAINING 
EXAMPLES 

Our  training  data  is  a  set  of  rated  items,  represented  as 
Boolean  feature  vectors  (see  Table  2).  We  compute  the 
SVD  of  the  training  data  and  discard  the  n  smallest  sin¬ 
gular  values,  reducing  the  dimensionality  to  k.  Currently, 
we  set  k  to  0.9  ■  m,  where  m  is  the  rank  of  the  initial  ma¬ 
trix.  This  value  was  chosen  because  it  resulted  in  the  best 
classification  performance  (evaluated  using  a  tuning  set, 
see  Section  4).  The  singular  vectors  of  matrix  U  scaled  by 
the  remaining  singular  values  represent  rated  items  in  k 
dimensions.  These  vectors  become  our  new  training  ex¬ 
amples.  Since  we  compute  the  SVD  of  the  training  data, 
resulting  in  real-valued  feature  vectors  of  size  k,  we  need 
to  specify  how  we  transform  examples  to  be  classified 
into  this  format.  Based  on  the  geometric  interpretation  of 
the  SVD,  the  solution  to  this  problem  is  straightforward. 
We  compute  a  /k-dimensional  vector  for  an  item,  so  that 
with  appropriate  rescaling  of  the  axes  by  the  singular  val¬ 
ues,  it  is  placed  at  the  centroid  of  all  the  user  ratings  that  it 
contains.  Mathematically,  we  can  compute  this  vector  as: 

where  v  is  a  Boolean  feature  vector  containing  user  rat¬ 
ings,  (/*  is  a  matrix  of  singular  vectors  with  k  elements  in 


each  vector,  and  2i  is  a  diagonal  matrix  containing  the  k 
largest  singular  values. 

At  this  point  we  need  to  pick  a  suitable  learning  algorithm 
that  takes  real-valued  feature  vectors  as  its  input  and 
learns  a  function  that  either  predicts  class  membership  or 
computes  a  score  a  user  would  assign  to  an  item.  Ideally, 
we  would  like  to  use  a  learning  paradigm  that  allows  for 
maximum  flexibility  in  evaluating  this  task  as  either  a 
regression  or  classification  problem.  Therefore,  we  se¬ 
lected  artificial  neural  networks  as  the  method  of  choice 
for  our  purposes  (Rumelhart  and  McLelland,  1986).  It  can 
be  shown  that  neural  networks  with  linear  output  units 
and  a  single  hidden  layer  can  approximate  any  continuous 
function /by  increasing  the  size  of  the  hidden  layer  (Ri¬ 
pley,  1996).  This  allows  us  to  solve  a  regression  problem. 
Alternatively,  if  we  replace  the  linear  output  units  by  lo¬ 
gistic  units,  we  can  use  the  same  framework  to  perform 
logistic  regression,  or  learn  to  discriminate  between 
classes.  We  ran  various  experiments  on  a  tuning  set  of  the 
data  available  to  us,  to  determine  a  network  topology  and 
learning  paradigm  that  resulted  in  good  performance  (see 
Section  4  for  details  on  the  experimental  evaluation).  The 
winning  approach  was  a  feed-forward  neural  network 
with  k  input  units,  2  hidden  units  and  1  output  unit.  The 
hidden  units  use  sigmoid  functions,  while  the  output  unit 
is  linear.  Weights  arc  learned  with  backpropagation.  Al¬ 
though  the  task  at  hand  might  suggest  using  a  user’s  rat¬ 
ing  as  the  function  value  to  predict,  we  found  that  a 
slightly  different  approach  resulted  in  better  performance. 
We  determined  the  average  rating  for  an  item'  and  trained 
the  network  on  the  difference  between  a  user’s  rating  and 
the  average  rating.  This  function  appeared  to  be  easier  to 
learn,  presumably  because  the  function  values  take  on 
extreme  values  less  frequently  and  in  these  cases  express 
a  user's  individual  taste.  In  order  to  predict  scores  for 
items,  the  output  of  the  network  needs  to  be  added  to  the 
mean  of  the  item.  We  then  used  a  threshold  t  (depending 
on  the  rating  scale  of  the  domain,  see  next  section)  to 
convert  the  predicted  rating  to  a  binary  class  label.  In 
summary,  our  algorithm  for  collaborative  filter  induction 
proceeds  in  the  following  steps: 

Training: 

•  Convert  the  training  data,  a  sparse  matrix  of  user 
ratings,  to  Boolean  feature  vectors,  resulting  in  a  ma¬ 
trix  filled  with  zeros  (false)  and  ones  (true). 

•  Compute  the  SVD  of  the  training  data. 

•  Select  k,  the  number  of  dimensions  to  retain,  and 
reduce  the  extracted  singular  vectors  accordingly. 

•  Train  a  neural  network  with  singular  vectors  scaled 
by  singular  values. 


'  The  average  is  computed  using  ratings  from  all  users  who  rated 
the  item,  except  the  user  whose  rating  is  to  be  predicted. 
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Predicting: 

•  Convert  the  item’s  user  ratings  to  a  Boolean  feature 
vector. 

•  Scale  the  feature  vector  into  the  k-dimensional  space. 

•  Feed  the  resulting  real-valued  vector  to  the  trained 
neural  network  to  compute  a  prediction. 

4  EXPERIMENTAL  EVALUATION 

In  this  section  we  report  results  of  the  experimental 
evaluation  of  our  proposed  algorithm.  We  describe  the 
data  set  used,  the  experimental  methodology,  as  well  as 
performance  measures  we  consider  appropriate  for  this 
task. 

4.1  THE  EACHMOVIE  DATABASE 

We  ran  experiments  using  data  from  the  EachMovie  col¬ 
laborative  filtering  service.  The  EachMovie  service  was 
part  of  a  research  project  at  the  Systems  Research  Center 
of  Digital  Equipment  Corporation.  The  service  was  avail¬ 
able  for  a  period  of  18  months  and  was  shut  down  in 
September  1997.  During  that  time  the  database  grew  to  a 
fairly  large  size,  containing  ratings  from  72,916  users  on 
1,628  movies.  User  ratings  were  recorded  on  a  numeric 
six-point  scale  (0.0,  0.2,  0.4,  0.6,  0.8,  1.0).  The  data  set  is 
publicly  available  and  can  be  obtained  from  Digital 
Equipment  Corporation  (McJones,  1997). 

Although  data  from  72,916  users  is  available,  we  restrict 
our  analysis  to  the  first  2,000  users  in  the  database.  These 
2,000  users  provided  ratings  for  1,410  different  movies. 
We  restricted  the  number  of  users  considered,  because  we 
are  interested  in  the  performance  of  the  algorithm  under 
conditions  where  the  ratio  of  users  to  items  is  low.  This  is 
a  situation  that  every  collaborative  filtering  service  has  to 
go  through  in  its  startup-phase,  and  in  many  domains  we 
cannot  expect  to  have  significantly  more  users  than  items. 
We  also  believe  that  the  deficiencies  of  correlation-based 
approaches  will  be  more  noticeable  under  these  condi¬ 
tions,  because  it  is  less  likely  to  find  users  with  consider¬ 
able  overlap  of  rated  items. 

4.2  PERFORMANCE  MEASURES 

We  are  most  interested  in  a  system  that  can  accurately 
distinguish  between  movies  a  user  would  like  and  all 
other  movies  rather  than  a  method  that  accurately  predicts 
the  numeric  rating  of  every  movie.  Of  course,  a  method 
that  predicts  the  actual  ratings  most  exactly  could  also  be 
the  best  classifier  for  this  classification  task.  To  analyze 
this,  we  defined  two  classes,  hot  and  cold,  that  were  used 
to  label  movies.  When  transforming  movies  to  training 
examples  for  a  particular  user,  we  label  movies  as  hot  if 


the  rating  for  the  movie  was  0.8  or  1.0,  or  cold  otherwise. 
We  decided  to  use  this  threshold  since  we  are  interested  in 
identifying  movies  the  user  would  like  and  feel  strongly 
about.  Since  the  correlation-based  approaches  as  well  as 
the  neural  network  predict  numeric  ratings,  we  base  the 
classification  of  movies  on  this  numeric  prediction,  and 
classify  them  as  hot  if  the  predicted  rating  exceeds  the 
threshold  0.7  (midpoint  between  the  two  possible  user 
ratings  0.6  and  0.8).  At  the  same  time,  we  can  still  use  the 
predicted  score  to  rank-order  classified  movies.  Not  only 
does  assigning  class  labels  allow  us  to  measure  classifica¬ 
tion  accuracy,  we  can  also  apply  additional  performance 
measures,  such  as  precision  and  recall,  commonly  used 
for  information  retrieval  tasks.  In  our  domain,  precision  is 
the  percentage  of  movies  classified  as  hot  that  are  hot,  and 
recall  is  the  percentage  of  hot  movies  that  were  classified 
as  hot.  We  believe  that  these  measures  are  appropriate  for 
our  study,  because  we  would  like  to  quantify  performance 
for  a  task  that  has  the  identification  of  relevant  items  as  its 
goal. 

It  is  important  to  evaluate  precision  and  recall  in  con¬ 
junction,  because  it  is  easy  to  optimize  either  one  sepa¬ 
rately.  However,  for  a  classifier  to  be  useful  for  our  pur¬ 
poses  we  demand  that  it  be  precise  as  well  as  have  high 
recall.  In  order  to  quantify  this  with  a  single  measure, 
(Lewis  and  Gale,  1994)  proposed  the  F-measure,  a 
weighted  combination  of  precision  and  recall  that  pro¬ 
duces  scores  ranging  from  0  to  1.  Here  we  assign  equal 
importance  to  precision  and  recall: 

p  _2-  precision  ■  recall 
precision  -f-  recall 

In  summary,  we  measure  the  overall  performance  of  the 
algorithms  using  classification  accuracy  and  the 
F-measure.  Since  we  see  the  F-measure  as  a  useful  con¬ 
struct  to  compare  classifiers,  but  think  that  it  is  not  an 
intuitive  measure  to  indicate  a  user's  perception  of  the 
usefulness  of  an  actual  system,  we  use  an  additional 
measure:  precision  at  the  top  n  ranked  items  (here,  we 
report  scores  for  n  =  5  and  n  =  10). 

4.3  EXPERIMENTAL  METHODOLOGY 

Since  we  are  interested  in  the  performance  of  the  algo¬ 
rithms  with  respect  to  the  number  of  ratings  provided  by 
users,  we  report  learning  curves  where  we  vary  the  num¬ 
ber  of  rated  items  from  10  to  50.  For  each  user  we  ran  a 
total  of  30  paired  trials  for  each  algorithm.  For  an  indi¬ 
vidual  trial  of  an  experiment,  we  randomly  selected  50 
rated  items  to  use  as  a  training  set,  and  30  as  a  test  set.  We 
then  started  training  with  10  examples  out  of  the  set  of  50 
and  increased  the  training  set  incrementally  in  steps  of  10, 
measuring  the  algorithms'  performance  on  the  test  set  for 
each  training  set  size.  Final  results  for  one  user  are  then 
averaged  over  all  trials.  We  repeated  this  for  20  users  and 
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Figure  1 :  Learning  Curves 


the  final  curves  reported  here  are  averaged  over  those  20 
users. 

The  actual  size  of  the  feature  vectors  used  to  train  the 
neural  network  depends  on  the  number  of  rated  items  in 
the  current  training  set,  as  well  as  the  particular  rated 
items.  Initially,  every  training  example  consists  of  4000 
Boolean  values  (2000  users  *  2  features  per  user).  Delet¬ 
ing  all  features  that  appear  less  than  twice  reduces  the 
number  of  features  approximately  by  a  factor  of  4  (see 
section  3.2),  i.e.  if  we  start  to  train  our  algorithm  with  10 
examples,  we  have  an  initial  10  x  1000  matrix  of  training 
data.  After  decomposing  this  matrix  using  the  SVD,  the 
matrix  U  that  represents  rated  items  in  a  space  of  lower 
dimensions  is  a  10  x  10  matrix  (because  the  initial  matrix 
has  10  columns  and  this  is  also  the  rank  of  the  matrix). 
Since  we  keep  only  90%  of  the  singular  values,  the  re¬ 
sulting  feature  vectors  consist  of  9  real  values.  Likewise, 
if  we  have  50  examples  in  the  training  set,  the  resulting 
size  of  every  training  example  after  dimensionality  re¬ 
duction  is  45. 

We  determined  parameters  for  our  algorithms  using  a 
tuning  set  of  20  randomly  selected  users.  The  results  re¬ 
ported  here  are  averaged  over  20  different  users.  The 
training  data  for  these  users  is  based  on  ratings  from  the 
first  2000  users  of  the  database,  as  described  earlier.  We 


selected  users  randomly,  but  with  the  following  con¬ 
straints.  First,  only  users  whose  prior  probability  of  liking 
a  movie  is  below  0.75  are  considered.  Otherwise,  scores 
that  indicate  high  precision  of  our  algorithms  might  be 
biased  by  the  fact  that  there  are  some  users  in  the  database 
who  either  like  everything  or  just  gave  ratings  for  movies 
they  liked.  Second,  only  users  that  rated  at  least  80  mov¬ 
ies  were  selected,  so  that  we  could  use  the  same  number 
of  training  and  test  examples  for  all  users. 

4.4  SUMMARY  OF  RESULTS 

Figure  1  summarizes  the  performance  of  three  different 
algorithms.  The  algorithm  labeled  Correlation  is  the  cor¬ 
relation-based  approach  that  performed  best  on  this  data 
out  of  the  strategies  described  in  Section  2.  This  approach 
uses  the  prediction  formula  as  described  in  (Resnick  et  al 
1994)  and  summarized  in  Section  2.  We  consider  all  cor¬ 
relations,  i.e.  we  do  not  require  correlations  to  be  above  a 
certain  threshold.  The  algorithm  labeled  SVD/ANN  is  our 
dimensionality  reduction  approach  coupled  with  a  neural 
network  as  described  in  Section  3.3.  Since  this  algorithm 
is  a  combination  of  a  feature  extraction  technique  (SVD) 
and  a  learning  algorithm  (ANN),  the  observed  perform¬ 
ance  does  not  allow  us  to  infer  anything  about  the  relative 
importance  of  each  technique  individually.  Therefore,  we 
report  the  performance  of  a  third  algorithm,  labeled  Info- 
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Gain/ANN,  in  order  to  quantify  the  importance  of  our 
proposed  feature  extraction  technique.  InfoGain/ANN  uses 
the  same  neural  network  setup  as  SVD/ANN,  but  applies  a 
different  feature  selection  algorithm.  Here,  we  compute 
the  expected  information  gain  (Quinlan,  1986)  of  all  the 
initial  features  and  then  select  the  n  most  informative 
features,  where  n  is  equivalent  to  the  number  of  features 
used  by  SVD/ANN  for  each  training  set  size.  Since  ex¬ 
pected  information  gain  cannot  detect  interaction  and  de¬ 
pendencies  among  features,  the  difference  between 
SVD/ANN  and  InfoGain/ANN  allows  us  to  quantify  the 
utility  of  the  SVD  for  this  task. 

The  results  show  that  both  SVD/ANN,  as  well  as  Info¬ 
Gain/ANN,  performed  better  than  the  correlation  ap¬ 
proach.  In  addition,  SVD/ANN  is  more  accurate  and  sub¬ 
stantially  more  precise  than  InfoGain/ANN.  At  50  training 
examples  Correlation  reaches  a  classification  accuracy  of 
64.4%,  vs.  67.9%  for  SVD/ANN.  While  predictive  accu¬ 
racy  below  70%  might  initially  seem  disappointing,  we 
need  to  keep  in  mind  that  our  goal  is  not  the  perfect  clas¬ 
sification  of  all  movies.  We  would  like  to  have  a  system 
that  identifies  many  interesting  items  and  does  this  with 
high  precision.  This  ability  is  measured  by  the  F-measure 
and  we  can  see  that  SVD/ANN  has  a  significant  advantage 
over  the  correlation  approach  (at  50  examples  54.2%  for 
Correlation  vs.  68.8%  for  SVD/ANN).  Finally,  if  we  re¬ 
strict  our  analysis  to  the  top  3  or  top  10  suggestions  of 
each  algorithm,  we  can  see  that  SVD/ANN  is  much  more 
precise  than  the  other  two  algorithms.  At  50  training  ex¬ 
amples  Correlation  reaches  a  precision  of  72.6%  at  the 
top  3  suggestions,  InfoGain/ANN ’s  precision  is  78.3%  and 
SVD/ANN  reaches  83.9%.  These  results  are  encouraging 
and  provide  empirical  evidence  that  the  use  of  theoreti¬ 
cally  well-founded  learning  algorithms  can  lead  to  im¬ 
proved  predictive  performance  on  collaborative  filtering 
tasks.  Furthermore,  we  have  shown  that  an  additional 
performance  increase  can  be  obtained  through  the  use  of 
appropriate  dimensionality  reduction  techniques,  such  as 
the  SVD. 

5  DISCUSSION  AND  FUTURE  WORK 

Our  experiments  illustrate  the  potential  of  dimensionality 
reduction  techniques  that  exploit  the  underlying  “latent 
structure”  of  user  ratings.  The  key  to  success  of  this 
method  is  that  it  can  utilize  information  from  users  whose 
ratings  are  not  correlated,  or  who  have  not  even  rated 
anything  in  common.  However,  since  we  are  computing 
the  SVD  of  the  training  data,  i.e.  a  matrix  consisting  only 
of  feature  vectors  for  all  items  a  user  has  rated,  we  might 
not  be  exploiting  the  full  potential  of  the  method.  Includ¬ 
ing  feature  vectors  of  items  that  the  user  has  not  rated  in 
the  matrix  to  decompose  will  affect  the  position  of  the 
singular  vectors  corresponding  to  labeled  training  exam¬ 
ples  in  k-dimensional  space.  Future  experiments  will  re¬ 


veal  if  further  performance  improvements  can  be 
achieved  through  the  addition  of  unlabeled  training  data. 

We  believe  that  additional  knowledge  about  the  similarity 
of  users  and  items  can  be  gained  through  the  analysis  of 
textual  descriptions  of  items.  Our  long-term  goal  of  this 
work  is  to  combine  collaborative  and  content-based  fil¬ 
tering  techniques.  Similarity  between  users  could  then  be 
influenced  by  similarity  between  descriptions  of  rated 
items.  This  is  a  very  desirable  characteristic,  as  it  would 
further  reduce  the  need  for  ratings  of  common  items.  We 
believe  that  content-based  techniques  will  fit  nicely  into 
the  learning  framework  presented  in  this  paper.  Since 
items  correspond  to  feature  vectors,  one  could  extend 
these  feature  vectors  to  contain  content-based  features. 
We  started  to  run  initial  experiments  using  textual  de¬ 
scriptions  of  movies,  extending  feature  vectors  with  Boo¬ 
lean  features  indicating  the  presence  or  absence  of  words. 
These  experiments  have  not  yet  led  to  significant  per¬ 
formance  improvements.  However,  we  assume  that  the 
reason  for  this  is  the  form  of  textual  movie  descriptions 
available  to  us  for  these  first  experiments,  rather  than  the 
viability  of  the  method  itself 

While  the  proposed  SVD/ANN  approach  leads  to  per¬ 
formance  gains,  it  is  significantly  more  computationally 
expensive  than  the  other  approaches  discussed  here.  The 
SVD  implementation  used  in  our  experiments  is  a  single¬ 
vector  Lanczos  method  which  is  part  of  the  publicly 
available  software  package  SVDPACKC  (Berry,  1992).  Its 
computational  complexity  is  0(3Dz),  where  z  is  the  num¬ 
ber  of  non-zero  elements  in  the  matrix  and  D  is  the  num¬ 
ber  of  dimensions  to  be  computed.  In  our  experiments  we 
observed  training  times  (SVD  +  network  training)  ranging 
from  0.4  seconds  for  10  training  examples  to  2.3  seconds 
for  50  training  examples^.  While  these  times  would  allow 
for  the  application  of  the  algorithm  as  part  of  an  intelli¬ 
gent  information  agent  operating  under  real-time  condi¬ 
tions,  we  need  to  keep  in  mind  that  we  restricted  our  ex¬ 
periments  to  2000  users.  Including  more  users  leads  to 
larger  matrices  to  be  decomposed  and  the  algorithm  will 
slow  down.  Therefore,  it  remains  to  be  seen  if  similar 
techniques  could  be  applied  to  collaborative-filtering 
services  that  have  accumulated  large  amounts  of  data  and 
need  to  compute  predictions  under  real-time  conditions. 
However,  note  that  the  SVD  would  not  have  to  be  recom¬ 
puted  for  each  user.  The  SVD  of  large  portions  of  the 
available  data  could  be  precomputed,  and  new  items  that 
were  not  part  of  this  analysis  could  be  scaled  into  the  k- 
dimensional  space  as  described  in  Section  3.3.  The 
viability,  performance  and  complexity  of  this  approach 
will  be  the  subject  of  future  research. 
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Measured  on  a  ZOOMhz  Pentium  Pro  system. 
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6  SUMMARY  AND  CONCLUSIONS 

In  this  paper  we  have  identified  the  shortcomings  of  cor¬ 
relation-based  collaborative  filtering  techniques  and 
shown  how  these  problems  can  be  addressed  through  the 
application  of  classification  algorithms.  We  believe  that 
the  contributions  of  this  paper  are  twofold.  First,  we  have 
presented  a  representation  for  collaborative  filtering  tasks 
that  allows  the  use  of  virtually  any  machine  learning  algo¬ 
rithm.  We  hope  that  this  will  pave  the  way  for  further 
analysis  of  the  suitability  of  learning  algorithms  for  this 
task.  Second,  we  have  shown  that  exploiting  latent  struc¬ 
ture  in  matrices  of  user  ratings  can  lead  to  improved  pre¬ 
dictive  performance.  In  a  set  of  experiments  with  a  data¬ 
base  of  ratings  for  motion  pictures,  we  used  the  singular 
value  decomposition  to  project  user  ratings  and  rated 
items  into  a  lower  dimensional  space.  This  allows  users  to 
become  predictors  for  one  another’s  preferences  even 
without  any  overlap  of  rated  items.  Since  our  society  is 
already  being  characterized  as  an  information  society  that 
suffers  from  steadily  increasing  information  overload,  we 
regard  the  automated  induction  of  personalized  informa¬ 
tion  filters  as  an  important  research  problem.  The  Internet 
opens  up  new  possibilities  to  collect  enormous  amounts  of 
information  about  users’  likes  and  dislikes.  We  hope  this 
paper  will  help  develop  new  ideas  for  more  effective  use 
of  this  information. 
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Abstract 

An  approach  to  clustering  is  presented  that 
adapts  the  basic  top-down  induction  of  de¬ 
cision  trees  method  towards  clustering.  To 
this  aim,  it  employs  the  principles  of  instance 
based  learning.  The  resulting  methodology 
is  implemented  in  the  TIC  (Top  down  In¬ 
duction  of  Clustering  trees)  system  for  first 
order  clustering.  The  TIC  system  employs 
the  first  order  logical  decision  tree  representa¬ 
tion  of  the  inductive  logic  programming  sys¬ 
tem  Tilde.  Various  experiments  with  TIC 
are  presented,  in  both  propositional  and  re¬ 
lational  domains. 


1  INTRODUCTION 

Decision  trees  are  usually  regarded  as  representing  the¬ 
ories  for  classification.  The  leaves  of  the  tree  contain 
the  classes  and  the  branches  from  the  root  to  a  leaf 
contain  sufficient  conditions  for  classification. 

A  different  viewpoint  is  taken  in  Elements  of  Machine 
Learning  [Langley,  1996].  According  to  Langley,  each 
node  of  a  tree  corresponds  to  a  concept  or  a  cluster, 
and  the  tree  as  a  whole  thus  represents  a  kind  of  taxon¬ 
omy  or  a  hierarchy.  Such  taxonomies  are  not  only  out¬ 
put  by  decision  tree  algorithms  but  typically  also  by 
clustering  algorithms  such  as  e.g.  COBWEB  [Fisher, 
1987].  Therefore,  Langley  views  both  clustering  and 
concept-learning  as  instantiations  of  the  same  general 
technique,  the  induction  of  concept  hierarchies.  The 
similarity  between  classification  trees  and  clustering 
trees  has  also  been  noted  by  Fisher,  who  points  to 
the  possibility  of  using  TDIDT  (or  TDIDT  heuristics) 

*  The  authors  axe  listed  in  alphabeticcil  order. 


in  the  clustering  context  [Fisher,  1993]  and  mentions 
a  few  clustering  systems  that  work  in  a  TDIDT-like 
fashion  [Fisher  and  Langley,  1985]. 

Following  these  views  we  study  top-down  induction  of 
clustering  trees.  A  clustering  tree  is  a  decision  tree 
where  the  leaves  do  not  contain  classes  and  where 
eaeh  node  as  well  as  each  leaf  corresponds  to  a  cluster. 
To  induce  clustering  trees,  we  employ  principles  from 
instance  based  learning  and  decision  tree  induction. 
More  specifically,  we  assume  that  a  distance  measure 
is  given  that  computes  the  distance  between  two  exam¬ 
ples.  Furthermore,  in  order  to  compute  the  distance 
between  two  clusters  (i.e.  sets  of  examples),  we  employ 
a  function  that  computes  a  prototype  of  a  set  exam¬ 
ples.  A  prototype  is  then  regarded  as  an  example, 
which  allows  to  define  the  distance  between  two  clus¬ 
ters  as  the  distance  between  their  prototypes.  Given 
a  distance  measure  for  clusters  and  the  view  that  each 
node  of  a  tree  corresponds  to  a  cluster,  the  decision 
tree  algorithm  is  then  adapted  to  select  in  each  node 
the  test  that  will  maximize  the  distance  between  the 
resulting  clusters  in  its  subnodes. 

Depending  on  the  examples  and  the  distance  measure 
employed  one  can  distinguish  two  modes.  In  super¬ 
vised  learning  (as  in  the  classical  top-down  induction 
of  decision  trees  paradigm),  the  distance  measure  only 
takes  into  account  the  class  information  of  each  exam¬ 
ple  (see  e.g.  C4.5  [Quinlan,  1993],  CART  [Breiman  et 
al.,  1984]).  Also,  regression  trees  (SRT  [Kramer,  1996], 
CART)  should  be  considered  supervised  learning.  In 
unsupervised  learning,  the  examples  may  not  be  clas¬ 
sified  and  the  distance  measure  does  not  take  into  ac¬ 
count  any  class  information.  Rather,  all  attributes  or 
features  of  the  examples  are  taken  into  account  in  the 
distance  measure. 

The  Top-down  Induction  of  Clustering  trees  approach 
is  implemented  in  the  TIC  system.  TIC  is  a  first  order 
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clustering  system  as  it  does  not  employ  the  classical 
attribute  value  representation  but  that  of  first  order 
logical  decision  trees  as  in  SRT  [Kramer,  1996]  and 
Tilde  [Blockeel  and  De  Raedt,  1998].  So,  the  clusters 
corresponding  to  the  tree  will  have  first  order  defini¬ 
tions.  On  the  other  hand,  in  the  current  implemen¬ 
tation  of  TIC  we  only  employ  propositional  distance 
measures. 

Using  TIC  we  report  on  a  number  of  experiments. 
These  experiments  demonstrate  the  power  of  top-down 
induction  of  clustering  trees.  More  specifically,  we 
show  that  TIC  can  be  used  for  clustering,  for  regres¬ 
sion,  and  for  learning  classifiers. 

This  paper  significantly  expands  on  an  earlier  ex¬ 
tended  abstract  [De  Raedt  and  Blockeel,  1997]  in  that 
TIC  now  contains  a  pruning  method  and  also  that  this 
paper  provides  new  experimental  evidence. 

This  paper  is  structured  as  follows.  In  Section  2  we 
discuss  the  representation  of  the  data  and  the  induced 
theories.  Section  3  identifies  possible  applications  of 
clustering.  The  TIC  system  is  presented  in  Section 
4.  In  Section  5  we  empirically  evaluate  TIC  for  the 
proposed  applications.  Section  6  presents  conclusions 
and  related  work. 

2  THE  LEARNING  PROBLEM 

2.1  REPRESENTING  EXAMPLES 

We  employ  the  learning  from  interpretations  setting 
for  inductive  logic  programming.  For  the  purposes  of 
this  paper,  it  is  sufficient  to  regard  each  example  as  a 
small  relational  database,  i.e.  eis  a  set  of  facts.  Within 
learning  from  interpretations,  one  may  also  specify 
background  knowledge  in  the  form  of  a  Prolog  pro¬ 
gram  which  can  be  used  to  derive  additional  features 
of  the  examples.^  See  [De  Raedt  and  Dzeroski,  1994; 
De  Raedt,  1996;  De  Raedt  et  al.,  1998]  for  more  details 
on  learning  from  interpretations. 

For  instance,  examples  for  the  well-known  mutage¬ 
nesis  problem  [Srinivasan  et  al.,  1996]  can  be  de¬ 
scribed  by  interpretations.  Here,  an  interpreta¬ 
tion  is  simply  an  enumeration  of  all  the  facts  we 
know  about  one  single  molecule:  its  class,  lumo 
and  logp  values,  the  atoms  and  bonds  occurring 
in  it,  certain  high-level  structures. ..  We  can  rep¬ 
resent  it  e.g.  as  follows:  {logmutag(-0.7),  neg, 
lumo(-3.025),  logp(2.29),  atom(dl89J,c,22,-0.11), 

^The  interpretation  corresponding  to  each  example  e  is 
then  the  minimal  Herbrand  model  of  B  A  e. 


atom(X,Y,14,Z)? 


Figure  1:  A  clustering  tree 


atom(dl89_2,c,22,-0.11),bond(dl89-l,dl89-2,7), 
bond(dl89.2,dl89J,7),  ...} 

2.2  FIRST  ORDER  LOGICAL  DECISION 
TREES 

First  order  logical  decision  trees  are  similar  to  stan¬ 
dard  decision  trees,  except  that  the  test  in  each  node 
is  a  conjunction  of  literals  instead  of  an  test  on  an  at¬ 
tribute.  They  are  always  binary,  as  the  test  can  only 
succeed  or  fail.  A  detailed  discussion  of  these  trees  is 
beyond  the  scope  of  this  paper  but  can  be  found  in 
[Blockeel  and  De  Raedt,  1998].  We  will  use  these  trees 
to  represent  clustering  trees. 

An  example  of  a  clustering  tree,  in  the  mutagenesis 
context,  is  shown  in  Figure  1.  Note  that  in  a  classical 
logical  decision  tree  leaves  would  contain  classes.  Here, 
leaves  simply  contain  sets  of  examples  that  belong  to¬ 
gether.  Also  note  that  variables  occurring  in  tests  are 
existentially  quantified.  The  root  test,  for  instance, 
tests  whether  there  occurs  an  atom  of  type  14  in  the 
molecule.  The  whole  set  of  examples  is  thus  divided 
into  two  clusters:  a  cluster  of  molecules  containing  an 
atom  14  and  a  cluster  of  molecules  not  containing  any. 

This  view  is  in  correspondence  with  Langley’s  view¬ 
point  that  a  test  in  a  node  is  not  just  a  decision  crite¬ 
rion,  but  also  a  description  of  the  subclusters  formed 
in  this  node.  In  [Blockeel  and  De  Raedt,  1998]  we 
show  how  a  logical  decision  tree  can  be  transformed 
into  an  equivalent  logic  program,  which  could  alterna¬ 
tively  be  used  to  sort  examples  into  clusters.  The  logic 
program  contains  invented  predicates  that  correspond 
to  the  clusters. 

2.3  INSTANCE  BASED  LEARNING  AND 
DISTANCES 

The  purpose  of  conceptual  clustering  is  to  obtain  clus¬ 
ters  such  that  intra-cluster  distance  (i.e.  the  distance 
between  examples  belonging  to  the  same  cluster)  is 
as  small  as  possible  and  the  inter-cluster  distance  (i.e. 
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the  distance  between  examples  belonging  to  different 
clusters)  is  as  large  as  possible. 

In  this  paper,  we  assume  that  a  distance  measure  d 
that  computes  the  distance  d(ei,e-^  between  exam¬ 
ples  ei  and  62  is  given.  Furthermore,  there  is  also  a 
need  for  measuring  the  distance  between  different  clus¬ 
ters  (i.e.  between  sets  of  examples).  Therefore  we  will 
assume  as  well  the  existence  of  a  prototype  function  p 
that  computes  the  prototype  p{E)  of  a  set  of  examples 
E.  The  distance  between  two  clusters  Ci  and  C2  is 
then  defined  as  the  distance  d(p(Ci),p(C2))  between 
the  prototypes  of  the  clusters.  This  shows  that  the 
prototypes  should  be  considered  as  (possibly)  partial 
example  descriptions.  The  prototypes  should  be  suf¬ 
ficiently  detailed  as  to  allow  the  computation  of  the 
distances. 

For  instance,  the  distance  could  be  the  Euclidean  dis¬ 
tance  di  between  the  values  of  one  or  more  numerical 
attributes,  or  it  could  be  the  distance  d2  as  measured 
by  a  first  order  distance  measure  such  as  used  in  RIBL 
[Emde  and  Wettschereck,  1996]  or  KBG  [Bisson,  1992] 
or  [Hutchinson,  1997]. 

Given  the  distance  at  the  level  of  the  examples,  the 
principles  of  instance  based  learning  can  be  used  to 
compute  the  prototypes.  E.g.  di  would  result  in  a 
prototype  function  p\  that  would  simply  compute  the 
mean  for  the  cluster,  whereas  d^  could  result  in  func¬ 
tion  p2  that  would  compute  the  (possibly  reduced) 
least  general  generalisation^  of  the  examples  in  the 
cluster. 

Throughout  this  paper  we  employ  only  propositional 
distance  measures  and  the  prototype  functions  that 
correspond  to  the  instance  averaging  methods  along 
the  lines  of  [Langley,  1996].  However,  we  stress  that  - 
in  principle  -  we  could  use  any  distance  measure.  No¬ 
tice  that  although  we  employ  only  propositional  dis¬ 
tance  measures,  we  obtain  first  order  descriptions  of 
the  clusters  through  the  representation  of  first  order 
logical  decision  trees. 

2.4  PROBLEM-SPECIFICATION 

By  now  we  are  able  to  formally  specify  the  clustering 
problem: 

Given 


^Using  Plotkin’s  [1970]  notion  of  0-subsumption  or 
the  variants  corresponding  to  structural  matching  [Bisson, 
1992;  De  Raedt  et  at.,  1997]. 


•  a  set  of  examples  E  (each  example  is  a  set  of  tuples 
in  a  relational  database  or  equivalently,  a  set  of 
facts  in  Prolog), 

•  a  background  theory  B  in  the  form  of  a  Prolog 
program, 

•  a  distance  measure  d  that  computes  the  distance 
between  two  examples  or  prototypes, 

•  a  prototype  function  p  that  computes  the  proto¬ 
type  of  a  set  of  examples. 

Find:  a  first  order  clustering  tree. 

Before  discussing  how  this  problem  can  be  solved  we 
take  a  look  at  possible  applications  of  clustering  trees. 

3  APPLICATIONS  OF 
CLUSTERING  TREES 

Following  Langley’s  viewpoint,  a  system  such  as  C4.5 
can  be  considered  a  supervised  clustering  system 
where  the  “distance”  metric  is  the  class  entropy  within 
the  clusters  :  lower  class  entropy  within  a  cluster 
means  that  the  examples  in  that  cluster  are  more  sim¬ 
ilar  with  respect  to  their  classes.  Since  C4.5  employs 
class  information,  it  is  a  supervised  learner. 

Clustering  can  also  be  done  in  an  unsupervised  manner 
however.  When  making  use  of  a  distance  metric  to 
form  clusters,  this  distance  metric  may  or  may  not  use 
information  about  the  classes  of  the  examples.  Even 
if  it  does  not  use  class  information,  clusters  may  be 
coherent  with  respect  to  the  class  of  the  examples  in 
them. 

This  principle  leads  to  a  classification  technique  that 
is  very  robust  with  respect  to  missing  class  informa¬ 
tion.  Indeed,  even  if  only  a  small  percentage  of  the 
examples  is  labelled  with  a  class,  one  could  perform 
unsupervised  clustering,  and  assign  to  each  leaf  in  the 
concept  hierarchy  the  majority  class  in  that  leaf.  If 
the  leaves  are  coherent  with  respect  to  classes,  this 
method  would  yield  relatively  high  classification  accu¬ 
racy  with  a  minimum  of  class  information  available. 
This  is  quite  similar  in  spirit  to  Emde’s  method  for 
learning  from  few  classified  examples,  implemented  in 
the  COLA  system  [Emde,  1994]. 

A  similar  reasoning  can  be  followed  for  regression, 
leading  to  “unsupervised  regression”;  again  this  may 
be  useful  in  the  case  of  partially  missing  information. 
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We  conclude  that  clustering  can  extend  classification 
and  regression  towards  unsupervised  learning.  An¬ 
other  extension  in  the  predictive  context  is  that  clus¬ 
ters  can  be  used  to  predict  many  or  all  attributes  of 
an  example  at  once. 

Depending  on  the  application  one  has  in  mind,  mea¬ 
suring  the  quality  of  a  clustering  tree  is  done  in  differ¬ 
ent  ways.  For  classification  purposes  predictive  accu¬ 
racy  on  unseen  cases  is  typically  used.  For  regression 
an  often  used  criterion  is  the  relative  error,  which  is 
the  mean  squared  error  of  predictions  divided  by  the 
mean  squared  error  of  a  default  hypothesis  always  pre¬ 
dicting  the  mean.  This  can  be  extended  towards  the 
clustering  context  if  a  distance  measure  and  prototype 
function  are  available: 


RE  = 


with  e,-  the  examples,  e,-  the  predictions  and  p  the  pro¬ 
totype.  (A  prediction  is,  just  like  a  prototype,  a  par- 
'tial  example  description  that  is  sufficiently  detailed  to 
allow  the  computation  of  a  distance). 

If  clustering  is  considered  as  unsupervised  learning  of 
classification  or  regression  trees,  the  relative  error  of 
only  the  predicted  variable  or  the  accuracy  with  which 
the  class  variable  can  be  predicted  is  a  suitable  quality 
criterion.  In  this  case  classes  should  be  available  for 
the  evaluation  of  the  clustering  tree,  though  not  during 
(unsupervised)  learning.  Such  an  evaluation  is  often 
done  for  clusters,  see  e.g.  [Fisher,  1987]. 


4.1  SPLITTING 

The  splitting  criterion  used  in  TIC  works  as  follows. 
Given  a  cluster  C  and  a  test  T  that  will  result  in  two 
disjoint  subclusters  Ci  and  C2  of  C,  TIC  computes 
the  distance  d(p(Ci),p(C'2)),  where  p  is  the  prototype 
function.  The  best  test  T  is  then  the  one  that  maxi¬ 
mizes  this  distance.  This  reflects  the  principle  that  the 
inter-cluster  distance  should  be  as  large  as  possible. 

If  the  prototype  is  simply  the  mean,  then  maximiz¬ 
ing  inter-cluster  distances  corresponds  to  minimizing 
intra-cluster  distances,  and  splitting  heuristics  such 
as  information  gain  [Quinlan,  1993]  or  Gini  index 
[Breiman  ei  ai,  1984]  can  be  seen  as  special  cases 
of  the  above  principle,  as  they  minimize  intra-cluster 
class  diversity.  In  the  regression  context,  minimizing 
intra-cluster  variance  (e.g.  [Kramer,  1996])  is  another 
instance  of  this  principle. 

Note  that  our  distance- based  approach  has  the  advan¬ 
tage  of  being  applicable  to  both  numeric  and  symbolic 
data,  and  thus  generalises  over  regression  and  classifi¬ 
cation  . 

4.2  STOPPING  CRITERIA 

Stopping  criteria  are  often  based  on  significance  tests. 
In  the  classification  context  a  x^-test  is  often  used  to 
check  whether  the  class  distributions  in  the  subtrees 
differ  significantly  [Clark  and  Niblett,  1989;  De  Raedt 
and  Van  Laer,  1995].  Since  regression  and  clustering 
use  variance  as  a  heuristic  for  choosing  the  best  split, 
a  reasonable  heuristic  for  the  stopping  criterion  seems 
to  be  the  F-test.  If  a  set  of  examples  is  split  into  two 
subsets,  the  variance  should  decrease  significantly,  i.e. 


4  TIC:  TOP-DOWN  INDUCTION 
OF  CLUSTERING  TREES 

A  system  for  top-down  induction  of  clustering  trees 
called  TIC  has  been  implemented  as  a  subsystem  of 
the  ILP  system  Tilde [Blockeel  and  De  Raedt,  1998]. 
TIC  employs  the  basic  TDIDT  framework  as  it  is  also 
incorporated  in  the  Tilde  system.  The  main  point 
where  TIC  and  Tilde  differ  from  the  propositional 
TDIDT  algorithm  is  in  the  computation  of  the  (first 
order)  tests  to  be  placed  in  a  node,  see  [Blockeel  and 
De  Raedt,  1998]  for  details.  Furthermore,  TIC  differs 
from  Tilde  in  that  it  uses  other  heuristics  for  split¬ 
ting  nodes,  an  alternative  stopping  criterion  and  alter¬ 
native  tree  post-pruning  methods.  We  discuss  these 
topics  below. 


„  SS/{n  -  1) 

“  {SSl  +  SSR)/{n  -  2) 

should  be  significantly  large  {SS  is  the  sum  of  squared 
differences  from  the  mean  inside  the  set  of  examples, 
SSl  and  SSr  is  the  same  for  the  two  created  subsets 
of  the  examples,  n  is  the  total  number  of  examples).^ 

4.3  PRUNING  USING  A  VALIDATION 
SET 

The  principle  of  using  a  validation  set  to  prune  trees 
is  very  simple.  After  using  the  training  set  to  build  a 

®The  F-test  is  only  theoretically  correct  for  normally 
distributed  populations.  Since  this  assumption  may  not 
hold,  it  should  here  be  considered  a  heuristic  for  deciding 
when  to  stop  growing  a  branch,  not  a  real  statistical  test. 
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tree,  the  quality  of  the  tree  is  computed  on  the  valida¬ 
tion  set  (predictive  accuracy  for  classification  trees,  in¬ 
verse  of  relative  error  for  regression  or  clustering  trees). 
For  each  node  of  the  tree  the  quality  of  the  tree  if  it 
were  pruned  at  that  node  Q'  is  compared  with  the 
quality  Q  of  the  unpruned  tree.  If  Q'  >  Q  then  the 
tree  is  pruned. 

Such  a  strategy  has  been  successfully  followed  in  the 
context  of  classification  and  regression  (e.g.  CART 
[Breiman  et  al.,  1984])  as  well  as  clustering  (e.g. 
[Fisher,  1996]).  Fisher’s  method  is  more  complex  than 
ours  in  that  for  each  individual  variable  a  different 
subset  of  the  original  tree  will  be  used  for  prediction. 

In  the  current  implementation  of  Tilde  validation  set 
based  pruning  is  available  for  all  settings.  For  clus¬ 
tering  and  regression  it  is  the  only  pruning  criterion 
that  is  implemented.  It  is  only  reliable  for  reasonably 
large  data  sets  though.  When  learning  from  small  data 
sets  performance  decreases  because  the  training  set  be¬ 
comes  even  smaller  and  with  a  small  validation  set  a 
lot  of  pruning  is  due  to  random  influences. 

5  EXPERIMENTS 

5.1  DATA  SETS 

We  used  the  following  data  sets  for  our  experiments: 

•  Soybeans:  this  database  [Michalski  and  Chi- 
lausky,  1980]  contains  descriptions  of  diseased  soy¬ 
bean  plants.  Every  plant  is  described  by  35  at¬ 
tributes.  A  small  data  set  (46  examples,  4  classes) 
and  a  large  one  (307  examples,  19  classes)  are 
available  at  the  UCI  repository  [Merz  and  Mur¬ 
phy,  1996]. 

•  Iris:  a  simple  database  of  descriptions  of  iris 
plants,  available  at  the  UCI  repository.  It  con¬ 
tains  3  classes  of  50  examples  each.  There  are  4 
numerical  attributes. 

•  Mutagenesis:  this  database  [Srinivasan  et  al., 
1996]  contains  descriptions  of  molecules  for  which 
the  mutagenic  activity  has  to  be  predicted.  Orig¬ 
inally  mutagenicity  was  measured  by  a  real  num¬ 
ber,  but  in  most  experiments  with  ILP  systems 
this  has  been  discretized  into  two  values  (positive 
and  negative).  The  database  is  available  at  the 
ILP  repository  [Kazakov  et  al,  1996]. 

Srinivasan  et  al.  [1995]  introduce  four  levels  of 
background  knowledge;  the  first  2  contain  only 
structural  information  (atoms  and  bonds  in  the 


molecules) ,  the  other  2  contain  higher  level  infor¬ 
mation  (attributes  describing  the  molecule  as  a 
whole  and  higher  level  submolecular  structures). 
For  our  experiments  the  tests  allowed  in  the 
trees  can  make  use  of  structural  information  only 
(Background  2) ,  though  for  the  heuristics  numer¬ 
ical  information  from  background  3  can  be  used. 

•  Biodegradability:  a  set  of  62  molecules  of  which 
structural  descriptions  and  molecular  weights  are 
given.  The  biodegradability  of  the  molecules  is  to 
be  predicted.  This  is  a  real  number,  but  has  been 
discretized  into  four  values  (fast,  moderate,  slow, 
resistant)  in  most  past  experiments.  The  dataset 
was  provided  to  us  by  S.  Dzeroski  but  is  not  yet 
in  the  public  domain. 

The  data  sets  were  deliberately  chosen  to  include  both 
propositional  and  relational  data  sets.  For  each  indi¬ 
vidual  experiment  the  most  suitable  data  sets  were 
chosen  (w.r.t.  size,  suitability  for  a  specific  task,  and 
relevant  results  published  in  the  literature). 

Distances  were  always  computed  from  all  numerical 
attributes,  except  when  stated  otherwise.  For  the  Soy¬ 
beans  data  sets  all  nominal  attributes  were  converted 
into  numbers  first. 

5.2  EXPERIMENT  1:  PRUNING 

In  this  first  experiment  we  want  to  evaluate  the  effect 
of  pruning  in  TIC  on  both  predictive  accuracy  and  tree 
complexity.  We  have  applied  TIC  to  two  databases: 
Soybeans  (large  version)  and  Mutagenesis.  We  chose 
these  two  because  they  are  relatively  large  (as  noted 
before,  the  pruning  strategy  is  prone  to  random  influ¬ 
ences  when  used  with  small  datasets). 

For  both  data  sets  tenfold  crossvalidations  were  per¬ 
formed.  In  each  run  the  algorithm  divides  the  learning 
set  in  a  training  set  and  a  validation  set.  Clustering 
trees  are  built  and  pruned  in  an  unsupervised  manner. 
The  clustering  hierarchy  before  and  after  pruning  is 
evaluated  by  predicting  the  class  of  each  test  example. 

In  Figure  2,  the  average  accuracy  of  the  clustering  hi¬ 
erarchies  before  and  after  pruning  is  plotted  against 
the  size  of  the  validation  set  (this  size  is  a  parameter 
of  TIC),  and  the  same  is  done  for  the  tree  complex¬ 
ity.  The  same  results  for  the  Mutagenesis  database  are 
summarised  in  Figure  3. 

From  the  Soybeans  experiment  it  can  be  concluded 
that  TIC’s  pruning  method  results  in  a  slight  decrease 
in  accuracy  but  a  large  decrease  in  the  number  of 
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Figure  2;  Soybeans:  a)  Accuracy  before  and  after 
pruning;  b)  number  of  nodes  before  and  after  prun¬ 
ing 


nodes.  The  pruning  strategy  seems  relatively  stable 
w.r.t.  the  size  of  the  validation  set.  The  Mutage¬ 
nesis  experiment  confirms  these  findings  (though  the 
decrease  in  accuracy  is  less  clear  here). 

5.3  EXPERIMENT  2:  COMPARISON 
WITH  OTHER  LEARNERS 

In  this  experiment  we  compare  TIC  with  propositional 
clustering  systems  and  with  classification  and  regres¬ 
sion  systems.  A  comparison  with  propositional  cluster¬ 
ing  systems  is  hard  to  make  because  few  quantitative 
results  are  available  in  the  literature,  therefore  we  also 
compare  with  supervised  learners. 

We  applied  TIC  to  the  Soybean  (small)  and  Iris 
databases,  performing  tenfold  crossvalidations.  Learn¬ 
ing  is  unsupervised,  but  classes  are  assumed  to  be 
known  at  evaluation  time  (the  class  of  a  test  exam¬ 
ple  is  compared  with  the  majority  class  of  the  leaf 
the  example  is  sorted  into).  Table  1  compares  the  re¬ 
sults  with  those  obtained  with  the  supervised  learner 
Tilde. 

We  see  that  TIC  obtains  high  accuracies  for  these 
problems.  The  only  clustering  result  we  know  of  is 
for  COBWEB,  which  obtained  100%  on  the  Soybean 
dataset.  This  difference  is  not  significant.  Tilde’s  ac- 


Figure  3:  Mutagenesis:  Accuracy  and  size  of  the  clus¬ 
tering  trees 


TIC 

Tilde 

Database 

acc. 

tree  size 

acc. 

tree  size 

Soybean 

97% 

3.9  nodes 

100% 

3  nodes 

Iris 

92% 

15  nodes 

94% 

4  nodes 

Table  1:  Comparison  of  TIC  with  a  supervised  learner 
(averages  over  10-fold  crossvalidation). 


curacies  don’t  differ  much  from  those  of  TIC  which  in¬ 
duced  the  hierarchy  without  knowledge  of  the  classes. 
Tree  sizes  are  smaller  though. 

We  have  also  performed  an  experiment  on  the 
Biodegradability  data  set,  predicting  numbers.  For 
this  dataset  the  F-test  stopping  criterion  was  used  (sig¬ 
nificance  level  0.01),  but  no  validation  set  was  used 
given  the  small  size  of  the  data  set.  The  distance  used 
is  the  difference  between  class  values.  Table  2  com¬ 
pares  TIC’s  performance  with  Tilde’s  (classification, 
leave-one-out)  and  SRT’s  (regression,  sixfold). 

Our  conclusions  are  that  a)  for  unsupervised  learning 
TIC  performs  almost  a-s  well  as  other  unsupervised  or 
supervised  learners,  if  classification  accuracy  is  mea¬ 
sured;  and  b)  while  there  is  clearly  room  for  improve¬ 
ment  with  respect  to  using  TIC  for  regression,  post¬ 
discretization  of  the  regression  predictions  shows  that 
this  approach  is  competitive  with  classical  approaches 
to  classification. 
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1.0.0.  Tilde 

classification 

acc.  =  0.532 

1.0.0.  TIC 

regression 

RE  =  0.740 

1.0.0.  TIC 

classif.  via  regression 

acc.  =  0.565 

6-fold  SRT 

regression 

RE  =  0.34 

6-fold  TIC 

regression 

RE  =  1.13 

Table  2:  Comparison  of  regression  and  classification 
on  the  biodegradability  data  (l.o.o.=leave-one-out). 


5.4  EXPERIMENT  3:  PREDICTING 
MULTIPLE  ATTRIBUTES 

Clustering  allows  to  predict  multiple  attributes.  Since 
examples  in  a  leaf  must  resemble  each  other  as  much 
as  possible,  attributes  must  also  agree  as  much  as  pos¬ 
sible. 

By  sorting  unseen  examples  down  a  cluster  tree  and 
comparing  all  attributes  of  the  example  with  the  pro¬ 
totype  attributes,  we  get  an  idea  of  how  good  the  tree 
is.  This  is  an  extension  of  the  classical  evaluation,  as 
each  attribute  in  turn  is  a  class  now. 

We  did  a  tenfold  crossvalidation  for  the  following  ex¬ 
periment:  using  the  training  set  a  clustering  tree  is 
induced.  Then,  all  examples  of  the  test  set  are  sorted 
in  this  hierarchy,  and  the  prediction  for  all  of  their 
attributes  is  evaluated.  For  each  attribute,  the  value 
that  occurs  most  frequently  in  a  leaf  is  predicted  for 
all  test  examples  sorted  in  that  leaf. 

We  used  the  large  soybean  database,  with  pruning. 
Table  3  summarizes  the  accuracies  obtained  for  each 
attribute  and  compares  with  the  accuracy  of  major¬ 
ity  prediction.  The  high  accuracies  show  that  most 
attributes  can  be  predicted  very  well,  which  means 
the  clusters  are  very  coherent.  The  mean  accuracy  of 
81.6%  does  not  differ  significantly  from  the  83  ±  2% 
reported  in  [Fisher,  1996]. 

5.5  EXPERIMENT  4:  HANDLING 
MISSING  INFORMATION 

It  can  be  expected  that  clustering,  making  use  of  more 
attributes  than  just  class  attributes,  is  more  robust 
with  respect  to  missing  values.  We  showed  in  Experi¬ 
ment  2  that  unsupervised  learners  (where  the  heuris¬ 
tics  do  not  use  any  class  information  at  all)  can  yield 
trees  with  predictive  accuracies  close  to  those  of  su¬ 
pervised  learners,  but  all  class  information  was  still 
available  for  assigning  classes  to  leaves  after  the  tree 
was  built. 

In  this  experiment,  we  measure  the  predictive  accu- 


name 

range 

default 

acc. 

date 

0-6 

21.2% 

46.3% 

plant-stand 

0-1 

52.1% 

85.0% 

precip 

0-2 

68.4% 

79^2% 

temp 

0-2 

58.3% 

75.6% 

hail 

0-1 

68.7% 

71.3% 

cropJiist 

0-3 

32.2% 

45.0% 

area-damaged 

0-3 

32.9% 

54.4% 

severity 

0-2 

49.2% 

63.2% 

seed-tmt 

0-2 

45^6% 

51.1% 

germination 

0-2 

32.2% 

45.0% 

plant-growth 

0-1 

65.8% 

96.4% 

leaves 

0-1 

89.3% 

96.4% 

Iccifspots-hcilo 

0-2 

49.5% 

85.3% 

lecifspots_marg 

0-2 

52.2% 

86.6% 

leafspots-size 

0-2 

47.8% 

87.0% 

leafjshread 

0-1 

75.9% 

81.4% 

leeif-malf 

0-1 

87.3% 

88.3% 

leaf -mild 

0-2 

83.7% 

88.9% 

stem 

0-1 

54.1% 

98.4% 

lodging 

0-1 

80.7% 

80.0% 

stem -cankers 

0-3 

58.3% 

9o!6% 

CcinkerJesion 

0-3 

49.1% 

88.9% 

fruiting-bodies 

0-1 

73.6% 

84.3% 

external-decay 

0-2 

75.6% 

91.5% 

mycelium 

0-1 

95.8% 

96.1% 

int-discolor 

0-2 

86.6% 

95.4% 

sclerotia 

0-1 

93.2% 

96.1% 

fruit-pods 

0-3 

62.7% 

91.2% 

fruit-spots 

0-4 

53.4% 

87.0% 

seed 

0-1 

73.9% 

85.7% 

mold-growth 

0-1 

80.5% 

86.6% 

seed-discolor 

0-1 

79.5% 

84.0% 

seedjsize 

0-1 

81.8% 

88.6% 

shriveling 

0-1 

83.4% 

87.9% 

roots 

0-2 

84.7% 

95.8% 

mean 

81.6% 

Table  3:  Prediction  of  all  attributes  together  in  the 
Soybean  data  set 


racy  of  trees  when  class  information  as  well  as  other 
information  may  be  missing,  not  only  for  learning,  but 
also  for  assigning  classes  to  leaves  afterwards,  and  this 
for  several  levels  of  missing  information.  Our  aim  is  to 
investigate  how  predictive  accuracy  deteriorates  with 
missing  information,  and  to  compare  clustering  sys¬ 
tems  that  use  only  cl8iss  information  with  systems  that 
use  more  information. 

We  have  used  the  Mutagenesis  data  set  for  this  exper¬ 
iment  (for  each  example,  there  was  a  fixed  probabil¬ 
ity  that  the  value  of  a  certain  attribute  was  removed 
from  the  data;  this  probability  was  increased  for  con¬ 
secutive  experiments) ,  comparing  the  use  of  only  class 
information  [logmutag)  with  the  use  of  three  numer¬ 
ical  variables  (among  which  the  class)  for  computing 
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avaiilable  numerical  data 

logmutag 

all  three 

0.80 

0.81 

50% 

0.78 

0.79 

25% 

0.72 

0.77 

10% 

0.67 

0.74 

Table  4:  Classification  accuracies  obtained  for  Muta¬ 
genesis  with  several  distance  functions,  and  on  several 
levels  of  missing  information. 

distances.  This  experiment  is  similar  in  spirits  to  the 
ones  performed  with  COLA  [Emde,  1994].  Table  4 
shows  the  results.  As  expected,  performance  degrades 
less  quickly  when  more  information  is  available,  which 
supports  the  claim  that  the  use  of  more  than  just  class 
information  can  improve  performance  in  the  presence 
of  missing  information. 

6  CONCLUSIONS  AND  RELATED 
WORK 

We  have  presented  a  novel  first  order  clustering  sys¬ 
tem  TIC  within  the  TDIDT  class  of  algorithms.  TIC 
integrates  ideas  from  concept-learning  (TDIDT),  from 
instance  based  learning  (the  distances  and  the  pro¬ 
totypes),  and  from  inductive  logic  programming  (the 
representations)  to  obtain  a  clustering  system.  Several 
experiments  were  performed  that  illustrate  the  type  of 
tasks  TIC  is  useful  for. 

As  far  as  related  work  is  concerned,  our  work  is  re¬ 
lated  to  KBG  [Bisson,  1992],  which  also  performs  first 
order  clustering.  In  contraist  to  the  current  version  of 
TIC,  KBG  does  use  a  first  order  similarity  measure, 
which  could  also  be  used  within  TIC.  Furthermore, 
KBG  is  an  agglomerative  (bottom-up)  clustering  algo¬ 
rithm  and  TIC  a  divisive  one  (top-down).  The  divi¬ 
sive  nature  of  TIC  makes  TIC  as  efficient  as  classical 
TDIDT  algorithms.  A  final  difference  with  KBG  is 
that  TIC  directly  obtains  logical  descriptions  of  the 
clusters  through  the  use  of  the  logical  decision  tree 
format.  For  KBG,  these  descriptions  have  to  be  de¬ 
rived  in  a  separate  step  because  the  clustering  process 
only  produces  the  clusters  (i.e.  sets  of  examples)  and 
not  their  description. 

The  instance-based  learner  RIBL 

[Emde  and  Wettschereck,  1996]  uses  an  advanced  first 
order  distance  metric  that  might  be  a  good  candidate 
for  incorporation  in  TIC. 

While  [Fisher,  1993]  first  made  the  link  between 
TDIDT  and  clustering,  our  work  is  inspired  mainly 
by  [Langley,  1996].  From  this  point  of  view,  our  work 


is  closely  related  to  SRT  [Kramer,  1996],  who  builds 
regression  trees  in  a  supervised  manner.  TIC  can  be 
considered  a  generalization  of  SRT  in  that  TIC  can 
also  build  trees  in  an  unsupervised  manner,  and  can 
predict  multiple  values.  Finally,  we  should  also  refer  to 
a  number  of  other  approaches  to  first  order  clustering, 
which  include  Kluster  [Kietz  and  Morik,  1994],  [Yoo 
and  Fisher,  1991],  [Thompson  and  Langley,  199l]  and 
[Ketterlin  et  al,  1995]. 

Future  work  on  TIC  includes  extending  the  system  so 
that  it  can  employ  first  order  distance  measures,  and 
investigating  the  limitations  of  this  approach  (which 
will  require  further  experiments). 
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Abstract  classifier  when  faced  with  very  few 

training  examples. 


When  faced  with  inadequate  infor¬ 
mation,  humans  often  use  knowl¬ 
edge  gained  from  previous  experi¬ 
ence  to  help  them  in  making  de¬ 
cisions.  Even  when  this  knowl¬ 
edge  is  spread  thinly  among  many 
previous  experiences,  humans  are 
able  to  effectively  accumulate  and 
apply  it  to  a  current  classifica¬ 
tion  task  of  interest.  Inspired  by 
human  knowledge  reuse,  we  have 
previously  introduced  a  general 
framework  for  the  use  of  knowl¬ 
edge  embodied  in  existing  classi¬ 
fiers  to  aid  in  a  new  classification 
task.  In  this  framework,  a  supra- 
classifier  is  built  to  make  deci¬ 
sions  based  on  the  outputs  of  large 
numbers  of  previously  trained  clas¬ 
sifiers  designed  for  different,  but 
possibly  relevant  tasks.  In  this 
article,  we  discuss  the  Hamming 
Nearest  Neighbor  (HNN)  supra- 
classifier  architecture  and  mathe¬ 
matically  show  its  usefulness.  Ex¬ 
periments  on  public  domain  data 
sets  demonstrate  the  practicality 
of  the  framework  and  HNN  supra- 


Keywords:  Knowledge  Transfer, 
Nearest  Neighbor,  Curse  of  Di¬ 
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1  INTRODUCTION 

In  this  paper  we  mathematically  analyze  the 
Hamming  Nearest  Neighbor  (HNN)  supra- 
classifier  architecture  for  integrating  multi¬ 
ple  knowledge  sources,  based  on  a  recently 
introduced  framework  for  knowledge  reuse 
(Bollacker  &  Gho.sh,  1997).  We  demon¬ 
strate  the  ability  of  the  HNN  supra-classifier 
to  theoretically  approach  optimal  perfor¬ 
mance  even  with  minimal  training  samples, 
if  enough  relevant  knowledge  is  available.  In 
particular,  we  show  that  in  the  limit  of  hav¬ 
ing  only  one  training  sample  of  each  tar¬ 
get  class  but  with  an  infinite  number  of  in¬ 
dependent,  (at  least  weakly)  relevant  pre¬ 
viously  trained  classifiers  available,  a  per¬ 
fect  supra-classifier  is  approached.  We  first 
review  the  motivation  for  and  existing  re¬ 
search  related  to  knowledge  reuse  and  sum¬ 
marize  our  supra-classifier  based  knowledge 
reuse  framework. 


A  Supra-Classifier  Architecture  for  Scalable  Knowledge  Reuse  65 


1.1  MOTIVATION 

A  person  is  able  to  quickly  and  robustly 
recognize  patterns  from  very  few  samples. 
This  is  due,  at  least  in  part,  to  the  use  of 
the  vast  reservoir  of  experiential  knowledge 
from  which  he/she  may  draw.  He/she  may 
use  relevant  learned  knowledge  to  better  un¬ 
derstand  the  problem  domain,  thus  help¬ 
ing  to  constrain  the  interpretation  of  cur¬ 
rent  data.  One  of  the  most  impressive  traits 
of  human  knowledge  reuse  is  the  ability  to 
draw  simultaneously  from  a  large  number 
of  previous  experiences  quickly  and  easily. 
Each  bit  of  learned  knowledge  may  not  help 
much,  but  as  a  whole,  the  knowledge  gained 
from  the  whole  of  many  experiences  can 
paint  a  very  clear  picture  of  the  problem 
domain. 

Unlike  humans,  artificial  classification  sys¬ 
tems  often  depend  greatly  on  the  set  of 
training  samples  to  make  classification  de¬ 
cisions.  If  the  training  set  insufficiently  rep¬ 
resents  the  “essence”  of  a  classification  task, 
then  creation  of  a  well  generalizing  classifier 
for  that  task  may  not  be  possible.  It  is  natu¬ 
ral  then,  to  suggest  that  in  the  construction 
of  artificial  classifiers,  the  inclusion  of  pre¬ 
viously  learned  knowledge  embodied  in  pre¬ 
viously  existing  classifiers  is  a  potential  ap¬ 
proach  to  the  problem  of  inadequate  train¬ 
ing  data. 

Also  unlike  humans,  artificial  systems  have 
often  failed  in  their  ability  to  use  a  large 
number  of  weakly  relevant  information 
sources.  For  example,  the  “curse  of  dimen¬ 
sionality”  (e.g.  see  (Friedman,  1994))  is 
given  this  name  (at  least  partially)  because 
of  the  difficulties  it  represents  for  the  cre¬ 
ators  of  well  performing  artificial  classifiers 
when  faced  with  a  high  dimensional  input. 
An  ideal  architecture  for  classifier  knowl¬ 
edge  reuse  would  be  scalable  in  the  sense 
that  it  can  effectively  handle  the  high  di¬ 


mensional  input  resulting  from  use  of  large 
numbers  of  previously  trained  classifiers, 
even  if  most  of  them  are  only  marginally 
relevant. 

1.2  PREVIOUS  RESEARCH 

The  most  common  approaches  to  knowledge 
reuse  are  ones  that  are  often  not  considered 
to  be  “knowledge  reuse”  per  se,  but  instead 
cast  previously  gained  relevant  knowledge 
as  a  “domain”  which  is  crafted  (often  in 
an  ad  hoc  manner)  to  represent  the  under¬ 
lying  semantic  or  physical  structure  of  the 
problem.  For  example,  Bayesian  approaches 
(e.g.  (Mackay,  1995))  reuse  knowledge  in 
the  form  of  prior  class  probabilities  and 
prior  distributions  assumed  for  the  model 
parameters,  while  many  classifier  architec¬ 
tures  use  the  structure  and  value  of  model 
parameters  to  represent  domain  knowledge 
(e.g.  the  discriminant  function  in  statisti¬ 
cal  classifiers  (Fukunaga,  1990),  size  and  or¬ 
der  of  features  in  decision  trees  (Mitchell, 
1997),  and  the  type  and  number  of  hidden 
units,  amount  and  form  of  regularization 
in  feed-forward  neural  networks  (Ghosh  & 
Turner,  1994)).  Such  approaches  can  work 
very  well  if  the  inductive  bias  matches  the 
problem  very  closely.  However,  in  practice 
it  may  be  quite  difficult  to  select  and  tune  a 
proper  model.  Also,  standard  assumptions 
used  (independence  among  variables,  Gaus¬ 
sian  distributions,  etc.)  to  make  the  prob¬ 
lem  tractable  often  result  in  a  loss  of  accu¬ 
racy  (Heckerman,  1997;  Mackay,  1995). 

There  has  been  much  work  on  knowledge  in¬ 
tensive  learning  focusing  on  symbolic  rules 
extracted  from  and  used  in  the  creation  of 
neural  cl2issifiers  (e.g.  (Towell  &  Shavlik, 
1994;  Mahoney  &  Mooney,  1993)).  If  knowl¬ 
edge  can  be  represented  as  rules,  then  it  may 
be  used  to  build  a  better  classifier.  However, 
most  of  the  approaches  cannot  reuse  knowl- 
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edge  from  general  classifiers  and  have  not 
demonstrated  scalability  to  a  large  number 
of  simultaneous  weak  information  sources. 

Some  recent  work  in  knowledge  reuse  has 
focused  on  the  automated  extraction  and 
reuse  of  knowledge  from  the  data  sets  of 
other  relevant  classifiers,  including  reuse  of 
the  trained  classifiers  themselves.  Under 
the  belief  that  related  classification  tasks 
may  benefit  from  common  internal  features, 
Caruana  (Caruana,  1995)  has  created  a  mul¬ 
tilayer  perceptron  (MLP)  based  multiple 
classifier  system  that  is  trained  simultane¬ 
ously  to  perform  several  related  classifica¬ 
tion  tasks.  Baxter  (Baxter,  1994)  has  de¬ 
veloped  a  rigorous  analysis  of  a  similar  type 
of  architecture,  showing  that  as  the  number 
of  simultaneously  trained  tasks  increases, 
the  number  of  examples  needed  per  task 
for  good  generalization  decreases.  Pratt 
(Pratt,  1994)  has  explored  a  similar  knowl¬ 
edge  reuse  method  in  which  some  of  the 
trained  weights  from  one  MLP  network  are 
used  to  initialize  weights  in  an  MLP  to  be 
trained  for  a  later,  related  task.  A  differ¬ 
ent  approach  is  taken  by  Thrun  (Thrun  & 
O’Sullivan,  1996),  who  proposed  a  method 
to  estimate  classifier  relevance  by  measuring 
how  much  better  a  classifier  performs  with 
a  reused  scaling  vector  for  Nearest  Neighbor 
classifiers.  Tasks  with  mutually  helpful  scal¬ 
ing  vectors  can  be  “clustered”  into  related 
groups. 

Recently,  popular  approaches  such  as  stack¬ 
ing,  committees,  ensembles,  and  mixture  of 
experts  also  use  multiple  classifiers.  How¬ 
ever,  since  most  these  classifiers  try  to  solve 
the  same  task  (though  they  may  specialize 
in  different  input  regions)  and  do  not  use 
previously  created  classifiers,  they  are  sim¬ 
ply  good  methods  of  decomposing  a  classi¬ 
fication  task  into  simpler  tasks  and  do  not 
generally  reuse  previous  knowledge. 


2  KNOWLEDGE  REUSE 
FRAMEWORK 

In  our  framework  for  knowledge  reuse  (Bol¬ 
lacker  &  Ghosh,  1997),  classifiers  previ¬ 
ously  trained  to  perform  (potentially  rele¬ 
vant)  classification  tasks  are  termed  support 
classifiers  as  indicated  in  Figure  1.  Support 
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Figure  1:  A  Supra- Classifier  Reuse  Archi¬ 
tecture. 

classifiers  are  generally  (but  not  always)  de¬ 
signed  for  tasks  other  than  the  current  tar¬ 
get  classification  task  of  interest.  Our  reuse 
strategy  is  to  apply  the  input  values  of  each 
of  the  training  samples  available  for  the  tar¬ 
get  task  to  all  available  classifiers  sharing 
the  input  domain  with  the  target  classi¬ 
fier.  The  output  class  labels  of  the  tar¬ 
get  and  support  classifiers  are  observed  by 
a  second  stage  supra- classifier  which  makes 
the  ultimate  classification  (cr(')  in  the  fig¬ 
ure).  Since  no  internal  information  is  being 
used,  the  support  classifiers  can  be  of  any 
type.  All  classifiers  feeding  into  the  supra- 
classifier  must  share  an  ultimately  common 
input  domain.  This  domain  may  be  broad, 
such  as  the  domain  of  all  images. 

2.1  A  FEW  DEFINITIONS 

Let  the  target  classification  task  be  r,  and 
let  r  have  discrete  range  St  and  d  di- 
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mensional  input  domain  space  3?'^.  Let 
{x,y}r  :  X  e  €  Sj  be  the  set  of 

training  examples  for  task  r.  We  assume 
that  {x,y}r  is  a  sample  set  from  the  true 
distribution  for  task  r  with  associated  ran¬ 
dom  variable  (XrtYr)  £  (5R'^,5r).  Our 
goal  is  to  find  the  most  likely  value  of  the 
conditional  marginal  Yr\{Xr  =  x)  and  de¬ 
fine  this  maximum  likelihood  function  to 
be  t(x)  =  argmaxy  P(Yr  =  y\Xr  =  x). 
Thus,  t{-)  :  t(-)  €  <5,-  is  the  target  func¬ 
tion  that  we  would  like  to  approximate  us¬ 
ing  the  information  in  {x,y}T-  Let  B  be 
a  set  of  support  classification  tasks  which 
have  the  same  input  domain  space  3?“*  as 
task  r.  Let  {cfc(-)}  :  b  &  B  he  the  corre¬ 
sponding  set  of  classifiers  where  each  ct(-) 
maps  Sb  :  b  6  B}  Let  Xr  be  the 

random  variable  associated  with  the  input 
values  of  training  sample  set  {x,y}r.  Let 
Tr  \Tt  =  t.r(Xr)  be  defined  as  the  random 
variable  associated  with  the  target  function 
of  Xr-  Similarly,  let  Cb  Cb  =  CbiXr)  be 
the  random  variables  resulting  from  the  ap¬ 
plication  of  Xr  to  the  support  classifiers. 

An  Ideal  Supra- Classifier  c*(a;)  will  always 
choose  the  most  likely  class  of  the  y  €  Sr 
given  the  class  labels  {c6(a;)}  :b^B.  More 
specifically.  For  any  given  {zb  :  Zb  €  56}  : 
6  £  5  we  can  define  the  maximum  proba¬ 
bility  function  m(')  as  ni{{zb}  :  b  &  B)  = 
argmaXy  P{Tr  =  y\{Cb  =  Zb}  :  B  e  B).  We 
can  then  define  an  ideal  classifier  based  on 
this  maximum  probability  function  as 

c*(a:)  =  m({c6(x)}  :  6  £  S).  (1) 

where  c*(-)  has  associated  random  variable 
C*  :  C*  =  c*{Xr)-  In  practice  if  the  number 
of  support  classifiers  is  quite  large.  Equation 
1  is  not  directly  scalable  due  to  the  curse 

^Although  some  of  the  support  classifiers 
may  have  been  trained  for  task  r  directly,  in 
general  b  ^  t  and  Sr  ^  56,  as  the  tasks  are 
different. 


of  dimensionality  (Friedman,  1994).  There¬ 
fore,  approximating  approaches  to  Equation 
1  are  required.  An  empirical  comparison  of 
several  such  approaches  was  made  in  (Bol- 
lacker  &  Ghosh,  1997).  Somewhat  surpris¬ 
ingly,  for  the  case  of  few  training  examples, 
the  simple  Hamming  Nearest  Neighbor  was 
seen  to  provide  the  best  knowledge  reuse 
performance.  In  this  paper  we  provide  a 
mathematical  analysis  of  the  HNN  architec¬ 
ture  to  better  understand  its  excellent  per¬ 
formance  and  scalable  properties. 

3  HAMMING  NEAREST 
NEIGHBOR  (HNN) 
SUPRA-CLASSIFIER 

The  HNN  classifier  is  similar  to  a  tradi¬ 
tional  nearest  neighbor  which  operates  in 
a  Euclidean  space.  The  HNN  operates  in 
a  “Hamming  space”  where  the  distance  be¬ 
tween  two  discrete  values  is  0  if  they  are  the 
same  and  1  if  different.  If  /(•)  is  the  in¬ 
dicator  function,  then  the  (Hamming)  dis¬ 
tance  measure  between  two  samples  xtrain 
and  xtest  can  be  calculated  as 

H^Xtraint  ^test)  — 

For  each  test  sample,  the  Hamming  Nearest 
Neighbor  (HNN)  supra-classifier  will  choose 
the  class  label  of  the  training  sample  with 
the  smallest  Hamming  distance  from  it.  We 
now  proceed  to  analyze  the  HNN  classifier 
to  show  that  under  certain  assumptions,  as 
more  support  classifiers  are  included,  the 
supra-classifier  can  approach  perfect  perfor¬ 
mance,  even  if  the  supporting  classifiers  are 
not  very  relevant  to  the  current  task.  Proofs 
of  the  following  lemmas  are  given  in  (Bol- 
lacker  &  Ghosh,  1998b). 

Definitions  and  Assumptions 
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Let  72.  C  3?'^  be  a  region  of  space  in  which 
a  distribution  of  samples  {x}  of  non-zero 
density  lie.  Let  t{x)  €  Sr  = 
be  the  true  classifier  label  of  some  sample 
X  e  72  where  N}  is  the  finite  set 

of  discrete  target  class  labels.  Let  C(,(x)  € 
Sb  =  {1, . . . ,  M}  be  a  support  classifier  la¬ 
beling.  Define  =  P{t{x)  =  j),  and 

=  P{t{x)  =  j|ci,(x)  =  i)  where  x 
is  a  sample  chosen  randomly  from  72.  P^ 
can  be  interpreted  to  mean  the  probability 
of  choosing  a  sample  of  target  class  j  when 
picking  randomly  from  72.  P-  is  the  proba¬ 
bility  of  choosing  a  sample  of  target  class  j 
when  picking  randomly  from  the  subset  of 
samples  in  72  which  are  of  support  class  i.  It 
can  be  seen  that  if  we  have  two  samples  Xq 
and  Xy,  randomly  and  independently  drawn 
from  Tl,  then  the  probability  of  them  having 
the  same  target  class  label  j  is 

Lemma  1: 

If  the  target  classes  j  :  j  =  1 . . .  iV  have 
equal  prior  probabilities  P-’ ,  then 

j=i 

This  result  is  used  to  show  that  knowing 
that  two  samples  have  the  same  support 
class  label  improves  their  chance  of  being 
of  the  same  target  class. 

Lemma  2: 

If  x„  and  Xy  are  drawn  randomly  and  in¬ 
dependently  from  72,  and  the  target  classes 
j  ■.  j  —  1 ..  .N  have  equal  prior  probabilities 
P-’  ,  then 

P(t(XQ.)  —  t  (x-j.)  [C(,  (Xq.)  C(,(x.y))  ^ 

P(t(x„)  =  t{Xy)). 

The  proof  consists  of  noticing  that  the 
chance  of  two  random  samples  being  the 
same  target  class  is  minimized  when  all  of 
the  P^  are  equal  (Lemma  1)  and  that  for 


any  partitioning  of  the  set  of  samples  in  72 
induced  by  the  labels  {f}  :  Cb{-)  =  i,i  = 
1 . . .  M,  the  probability  of  two  samples  be¬ 
ing  the  same  target  class  cannot  be  reduced 
further. 

Now  we  can  use  Lemma  2  to  show  that  two 
samples  randomly  and  independently  cho¬ 
sen  from  72  have  as  good  or  better  chance 
of  being  of  the  same  support  class  if  they 
are  of  the  same  target  class  than  if  they  are 
of  different  target  classes. 

Lemma  3: 

If  Xq,  X0,  and  Xy  are  drawn  randomly  and 
independently  from  72,  then 

P{Cb{T.a)  =  Cfc(x^)|t(x„)  =  <(x.^)) 

>  P{cb{xp)  =  Cb{xy)\t{xi3)  yt  t{Xy)). 

The  case  of  equality  occurs  only  when  Cb{-) 
is  independent  of  t(  ). 

This  Lemma  is  interesting  in  the  context  of 
a  Nearest  Neighbor  classifier.  Let  x„  and 
xp  be  two  training  samples  and  Xy  be  a  test 
sample. 

Now  let  us  consider  the  use  of  n  support 
classifiers  to  build  an  HNN  supra-classifior. 
Taking  the  complements  of  the  events  in 
Lemma  3,  P{cb{xp)  Cb{xy)\t{xp)  ^ 

t{Xy))  >  P(Ci,(X„)  ^  Cb{Xy)\t{Xa)  =  t{Xy)), 

and  summing  over  all  6  :  6  =  1 ...  n,  we  can 
write 

n 

^P(Ci,(X^)  Cb{Xy)\t{Xp)  t{Xy))  > 

b=l 

n 

Y^P{Cb{Xa)  Cb{Xy)\t{x„)  =  t{Xy)). 

6=1 

If  we  let  (5(,  >  0  be  the  difference  between 
each  pair  of  terms  in  the  sums,  and  let  5"  = 
Z^6=i  write 

n 

Y^P{cb{xp)  7^  Cb{xy)\t{xp)  7^  <(x^.))  - 
6=1 
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n 

6=1 

n<5".  (2) 

Note  that  J;,  =  0  only  if  C6(-)  is  indepen¬ 
dent  of  t(-),  and  thus  is  of  no  use  for  the 
target  task.  We  will  assume,  as  the  number 
of  support  classifiers  grows,  the  fraction  of 
useful  ones  does  not  approach  zero,  (i.e.  Let 
(J°°  =  li7n„_+oo<i”  >  0). 

Returning  to  the  definition  of  the  HNN  clas¬ 
sifier,  we  write  the  Hamming  distances 

Dnis^a !  — 

n 

^(*^l>(Xa}  i=-  Cb{x^)\t{Xa)  =  t(Xy))  (3) 

5=1 

Dn{xp,X^)  = 
n 

^  ^  t(Xj))  (4) 

6=1 

Theorem  1: 

If  the  support  classifiers  Cb{-)  are  indepen¬ 
dent  of  each  other  conditionally  on  the  tar¬ 
get  class  t{-),  t{xa)  =  t{xy)  t\x0),  and  the 
priors  (P^)  for  each  target  class  are  equal, 
then 

lim  P{Dfil^{xf^^Xy)  >  Dfi(^X(jiyXy)')  ~  1. 
n— >oo 

Proof;  We 

will  apply  the  weak  law  of  large  numbers 
as  given  in  (Billingsley,  1979),  which  states 

>  ,)  =  0 

for  independent  trials  Yi  and  all  c  > 
0.  Noticing  that  P{cb{xi)  Cb{x2))  = 
E[I{cb{xi)  /  Cb{x2)]  where  J(-)  is  the  indi¬ 
cator  function  and  substituting  from  Equa¬ 
tions  2,  3,  and  4,  we  can  write 

lim  P(l  ^7)  E)n{xa,  Xy)  —  n5^ . 

n—^oo  n 

>  e)  =  0, 


which  leads  to 

—  Jjjjj  p/I  Enjxfj,  Xy)  —  Dn{Xa,  Xy)  —  TlS^  . 

n—^oo  ji  ' 

<  e)  =  1. 

^  lim  ~  J^n(x:a,Xy) 

n-yoo  n 

>  5"  -  e)  =  1, 

Since  we  assume  S°°  >  0,  we  can  choose  a 
sufficiently  small  e  :  e  <  S°°  and  write 

n 

—  P(^^^ 3;.y)  DjilxQ^^Xyf) 

"  °°  6=1 

=  1.  □ 

Theorem  1  states  that  in  the  limit  of  an  in¬ 
finite  number  of  conditionally  independent 
and  (at  least  barely  useful)  support  clas¬ 
sifiers  being  available,  the  probability  that 
the  HNN  classifier  will  predict  the  true  tar¬ 
get  class  approaches  1.  It  should  also  be 
noted  that  Theorem  1  holds  even  if  there 
is  only  one  training  sample  of  each  target 
class.  This  results  leads  to  the  observa¬ 
tion  that  under  certain  conditions,  a  wealth 
of  features  can  compensate  for  a  dearth  of 
samples.  This  is  counter  to  the  conven¬ 
tional  wisdom  that  more  feature  usually  re¬ 
quires  more  training  samples.  The  trick 
is  that  we  are  not  working  in  a  Euclidian 
feature  space,  and  so  do  not  fall  victim  so 
easily  to  the  typical  curse  of  dimensional¬ 
ity  problems.  Despite  this  compelling  anal¬ 
ysis  of  the  HNN  classifier,  a  careful  sep¬ 
aration  of  theory  from  practice  should  be 
made.  While  a  perfect  classifier  is  theoreti¬ 
cally  possible,  in  general  it  would  be  impos¬ 
sible  to  gather  an  infinite  number  of  inde¬ 
pendent,  relevant  support  classifiers.  Also, 
since  the  HNN  supra-classifier  is  computa¬ 
tionally  linear  in  the  number  support  clas¬ 
sifiers,  an  infinite  number  could  generally 
never  be  used.  However,  the  results  lead  us 
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to  believe  that  as  more  independent  support 
classifiers  or  training  samples  are  included, 
the  better  HNN  will  perform.  This  is  sup¬ 
ported  by  empirical  evidence  (Bollacker  & 
Ghosh,  1998a).  The  question  of  how  fast 
the  HNN  classifier  approaches  perfect  per¬ 
formance  is  only  answerable  for  a  specific 
set  of  training  samples  and  included  support 
classifiers.  An  analysis  of  large  deviations 
as  in  (Billingsley,  1979)  suggests  that  HNN 
would  approach  its  limit  exponentially  fast 
as  a  function  of  the  relevance  of  the  support 
classifiers. 

4  EXPERIMENTS 

Previously,  we  have  explored  the  empiri¬ 
cal  performance  of  the  HNN  supra-classifier 
(Bollacker  k  Ghosh,  1997),  with  some  of  the 
results  discussed  here.  We  used  three  public 
domain  data  sets  from  the  U.C  Irvine  Ma¬ 
chine  Learning  database  and  partitioned  the 
samples  from  each  data  set  into  two  disjoint 
and  unequal  sized  subsets  based  on  their 
class  labels.  The  larger  subset  was  used  to 
create  several  two-class  (non-target)  prob¬ 
lems  using  all  combinations  of  two  classes 
not  to  be  used  as  target  classes.  First,  a 
20000  sample  capital  English  letter  data  set 
(LR)  was  divided  into  the  target  data  set 
consisting  of  the  five  classes  “H” ,  “L” ,  “0” , 
“R”,  and  “S”,  and  210  other  classifier  data 
sets  consisting  of  two-class  combinations  of 
the  other  21  classes.  Second,  a  spoken  vowel 
data  set  (VOW)  consisted  of  990  samples 
evenly  distributed  among  11  spoken  vowels. 
The  two  classes  “hud”  and  “hed”  were  cho¬ 
sen  to  form  the  target  classifier  task  and 
the  remaining  9  classes  were  used  to  con¬ 
struct  36  other  2-class  classification  tasks  in 
a  manner  similar  to  the  LR  data  set.  Third, 
the  well  known  soybean  data  set  (SOY)  con¬ 
sisting  of  683  samples.  The  three  classes 
“phytophthora-rot” ,  “brown-spot” ,  and  “al- 


ternaria  leaf-spot”  were  chosen  to  be  the 
target  classes,  and  the  remaining  16  classes 
were  used  to  generate  120  other  (2-class) 
classifiers. 

The  three  data  sets  were  randomly  parti¬ 
tioned  into  equal  sized  training  and  test  sets. 
The  target  training  set  was  used  to  create 
MLP  and  single  nearest  neighbor  (1-NN) 
classifiers  for  each  target  problem.  The  210 
LR  2-class  classifiers  were  trained  MLP’s, 
while  the  120  soy  and  36  VOW  other  2- 
class  classifiers  were  single  Nearest  Neigh¬ 
bor  (1-NN)  classifiers.  These  classifier  archi¬ 
tectures  were  chosen  for  their  good  perfor¬ 
mance  on  those  tasks.  In  order  to  consider 
the  case  of  few  available  target  training  sam¬ 
ples,  only  a  fraction  of  the  available  target 
training  samples  was  actually  used.  The  set 
of  support  classifiers  for  each  problem  con¬ 
sisted  of  simple  classifiers  for  the  target  class 
and  all  of  the  2-class  classifiers  built  using 
non-target  class  samples.  Target  training 
sets  over  a  range  of  sizes  were  applied  to 
the  support  classifiers  for  the  three  prob¬ 
lems.  The  outputs  of  these  support  clas¬ 
sifiers  were  then  used  ris  the  input  vector 
for  an  HNN  supra-classifier.  Results  using 
the  LR  data  set  (averaged  over  20  trials), 
SOY  data  set  (100  trials),  and  VOW  data 
set  (100  trials)  can  be  seen  in  Figures  2,  3, 
and  4  respectively.  For  all  three  data  sets, 
the  HNN  supra-classifier  showed  improved 
performance  over  all  of  the  unaided  classi¬ 
fiers,  especially  with  small  target  training 
sets.  Moreover,  the  difference  between  the 
HNN  and  unaided  1-NN  for  few  examples 
was  calculated  to  be  statistically  significant 
with  greater  than  a  99%  certainty. 

5  CONCLUSIONS  AND 
FUTURE  WORK 

We  have  discussed  the  motivation  for 
reuse  of  knowledge  from  previously  trained 
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Figure  2:  Test  rate  vs.  number  of  training 
samples  on  the  letter  recognition  data  set. 

classifiers  and  presented  a  framework  for 
such  reuse  which  includes  the  concept  of 
supra-classifiers.  We  introduce  the  Ham¬ 
ming  Nearest  Neighbor  supra-classifier  and 
demonstrate  its  usefulness  both  analytically 
and  empirically.  This  gives  evidence  that 
the  HNN  supra-classifier  architecture  would 
be  a  useful  approach  to  the  problems  of  in¬ 
adequate  training  samples. 

In  the  future,  we  intend  to  do  further  analy¬ 
sis  of  the  HNN  supra-classifier  to  determine 
the  convergence  rate  as  more  support  classi¬ 
fiers  and  target  training  samples  are  added. 
A  practical  extension  will  be  an  application 
to  a  truly  complex  problem  domain.  We  en¬ 
vision  the  eventual  construction  of  a  “waxe- 
house”  of  previously  constructed  reusable 
classifiers  for  a  large  domain  of  interest  (e.g. 
image  databases),  where  the  set  of  support 
classifiers  will  serve  as  an  efficient  represen¬ 
tation  of  the  problem  domain  knowledge. 


Number  of  Training  Examples 

Figure  3:  Test  rate  vs.  number  of  training 
samples  on  the  soybean  data  set. 
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Abstract 

POMDPs  are  general  models  of  sequential  de¬ 
cisions  in  which  both  actions  and  observa¬ 
tions  can  be  probabilistic.  Many  problems 
of  interest  can  be  formulated  as  pomdps,  yet 
the  use  of  pomdps  has  been  limited  by  the 
lack  of  effective  algorithms.  Recently  this 
has  started  to  change  and  a  number  of  prob¬ 
lems  such  as  robot  navigation  and  planning 
are  beginning  to  be  formulated  and  solved 
as  POMDPS.  The  advantage  of  the  POMDP 
approach  is  its  clean  semantics  and  its  abil¬ 
ity  to  produce  principled  solutions  that  inte¬ 
grate  physical  and  information  gathering  ac¬ 
tions.  In  this  paper  we  pursue  this  approach 
in  the  context  of  two  learning  tasks:  learn¬ 
ing  to  sort  a  vector  of  numbers  and  learning 
decision  trees  from  data.  Both  problems  are 
formulated  as  POMDPS  and  solved  by  a  gen¬ 
eral  POMDP  algorithm.  The  main  lessons  and 
results  are  that  1)  the  use  of  suitable  heuris¬ 
tics  and  representations  allows  for  the  solu¬ 
tion  of  sorting  and  classification  pomdps  of 
non-trivial  sizes,  2)  the  quality  of  the  result¬ 
ing  solutions  are  competitive  with  the  best 
algorithms,  and  3)  problematic  aspects  in 
decision  tree  learning  such  as  test  and  mis- 
classification  costs,  noisy  tests,  and  missing 
values  are  naturally  accommodated. 

1  INTRODUCTION 

POMDPS  are  general  models  of  sequential  decisions  in 
which  both  actions  and  observations  can  be  proba¬ 
bilistic  (Sondik  1971;  Cassandra,  Kaebling,  &  Littman 
1994).  Many  problems  of  interest  can  be  formulated 
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as  POMDPS  yet  the  use  of  pomdps  has  been  limited 
by  the  lack  of  effective  algorithms  (Cassandra,  Kae- 
bling,  &  Littman  1995).  Recently  this  has  started  to 
change  and  a  number  of  problems  such  as  robot  nav¬ 
igation  and  planning  are  beginning  to  be  formulated 
and  solved  as  pomdps  (Cassandra,  Kaebling,  &  Kurien 
1996;  GefFner  &  Bonet  1998a).  The  advantage  of  the 
POMDP  approach  is  its  clean  semantics  and  its  ability 
to  produce  principled  solutions  that  integrate  physi¬ 
cal  and  information  gathering  actions.  In  this  paper 
we  pursue  this  approach  in  the  context  of  two  learn¬ 
ing  tasks:  learning  to  sort  a  vector  of  numbers  and 
learning  decision  trees  from  data.  Both  problems  are 
formulated  as  POMDPs  and  solved  by  a  general  POMDP 
algorithm  (Geffner  &  Bonet  1998b)  based  on  the  ideas 
of  Real  Time  Dynamic  Programming  (Barto,  Bradtke, 
&  Singh  1995). 

The  choice  of  the  two  tasks  requires  an  explanation. 
Both  are  sequential  decision  problems  that  can  be  nat¬ 
urally  seen  as  POMDPs.  Yet  the  difficulties  and  insights 
that  result  from  modeling  and  solving  each  problem  as 
a  POMDP  are  different.  Sorting  involves  finding  a  se¬ 
quence  of  comparisons  and  swaps  that  would  sort  any 
vector  of  size  n.  This  is  a  challenging  planning  prob¬ 
lem  and  we  are  not  aware  of  any  contingent  planner 
that  can  model  and  solve  problems  of  this  type.  Mod¬ 
eling  and  solving  the  problem  from  the  perspective  of 
POMDPS  is  challenging  too.  For  n  =  10,  the  num¬ 
ber  of  possible  states  in  the  problem  is  greater  than 
10®.  Until  recently  pomdps  with  more  than  20  states 
could  not  be  reasonably  solved,  especially  when  they 
involved  information-gathering  actions.  Here  we  pro¬ 
vide  solutions  for  pomdps  of  size  n  =  10  that  involve 
more  than  a  million  states.  Moreover  the  solutions 
are  good:  on  average  they  involve  half  the  number  of 
comparisons  and  swaps  as  Quicksort,  one  of  the  best 
sorting  algorithms  (Aho,  Hopcroft,  &  Ullman  1983). 
The  solution  method  relies  on  good  heuristic  func- 
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tions,  compact  representations  of  beliefs,  and  suitable 
decompositions. 

The  sorting  problem  is  difficult  and  we  use  it  not  to 
learn  about  sorting  but  to  learn  about  poMDPs.  The 
focus  on  decision  tree  induction  is  dffierent  as  we  ex¬ 
pect  that  the  POMDP  approach  may  contribute  to  a 
better  understanding  of  decision  tree  induction  on  as¬ 
pects  such  as  noisy  data  and  tests,  missing  values, 
and  tests  and  misclassification  costs.  All  these  a.s- 
pects  fit  into  the  POMDP  formulation  of  decision  tree 
learning  in  a  natural  way.  We  evaluate  this  formula¬ 
tion  over  a  number  of  datasets  from  (Murphy  &  Aha 
1998).  Our  goal  is  to  show  that  the  POMDP  approach 
may  be  competitive  with  the  standard  approaches 
and  potentially  more  general.  Indeed  pOMDPs  pro¬ 
vide  a  unifying  framework  for  modeling  and  solving 
not  only  sorting  and  induction,  but  other  AI  ta.sks  as 
well  such  as  robot  navigation,  planning,  control,  diag¬ 
nosis,  etc.  (Cassandra,  Kaebling,  &  Littman  1994; 
Geffner  &  Bonet  1998a).  On  the  other  hand,  the 
POMDPS  algorithms  we  use  do  not  scale  up  yet  to  learn¬ 
ing  problems  over  very  large  datasets. 

The  rest  of  the  paper  is  organized  as  follows.  First 
we  review  MDPs,  POMDPs,  and  the  POMDP  algorithm 
(Sections  2  and  3).  Then  we  formulate  the  problems 
of  sorting  and  decision  tree  induction  as  poMDPs,  and 
report  empirical  results  (Sections  4  and  5).  Finally  we 
summarize  the  main  lessons  and  ideas  (Section  6). 

2  BACKGROUND 

POMDPS  are  a  generalization  of  a  model  of  sequen¬ 
tial  decision  making  formulated  by  Richard  Bellman  in 
the  50’s  called  Markov  Decision  Processes  or  mdps,  in 
which  the  state  of  the  environment  is  assumed  known 
(Bellman  1957).  mdps  provide  the  basis  for  under¬ 
standing  POMDPS  so  we  turn  to  them  first. ^ 

2.1  MDPs 

The  type  of  MDPs  that  we  consider  is  a  generalization 
of  the  standard  search  model  used  in  AI  in  which  ac¬ 
tions  can  have  probabilistic  effects.  Goal  MDPs,  as  we 
call  them,  are  characterized  by; 

1.  a  state  space  S 

'For  some  recent  books  on  MDPs,  sec  (Puterman  1994; 
Bertsekas  &  Tsitsiklis  1996);  for  an  AI  perspective,  see 
(Boutilier,  Dean,  &  Hanks  1995;  Barto,  Bradtke,  &  Singh 
1995). 


2.  actions  A{s)  C  A  applicable  in  each  state  .s 

3.  positive  costs  c{a,  s)  of  performing  action  a  in  s 

4.  transition  probabilities  Pa(s^|s)  of  ending  up  in 
state  s'  after  doing  action  a  £  A(s)  in  state  s 

5.  goal  states  G  C  S 

Since  the  effect  of  actions  is  assumed  to  be  observable 
btit  not  predictable,  the  solution  of  an  MDP  is  not  an 
action  sequence  but  a  function  that  maps  states  s  into 
actions  a  £  A(.s).  Such  a  function  is  called  a  policy, 
and  its  effect  is  to  assign  a  probability  to  each  state 
trajectory.  We  assume  that  goal  states  are  absorbing 
in  the  sense  that  actions  in  those  states  have  no  effects 
and  zero  costs.  As  a  re.sult,  state  trajectories  that 
contain  goal  states  have  finite  costs,  while  others  have 
infinite  costs.  The  expected  cost  of  a  policy  from  an 
initial  state  is  the  weighted  average  of  the  costs  of  all 
the  state  trajectories  starting  in  that  state  times  their 
probability.  A  policy  is  optimal  when  its  expected  cost 
from  any  state  is  minimal.  General  conditions  for  the 
existence  of  such  policies  can  be  found  in  (Puterman 
1994;  Bertsekas  &  Tsitsiklis  1996). 

3  POMDPs 

POMDPS  generalize  MDPs  allowing  the  state  to  be  par¬ 
tially  observable  (Sondik  1971;  Cassandra,  Kaebling, 
&  Littman  1994;  Russell  &  Norvig  1994).  The  solution 
of  a  POMDP  is  no  longer  a  mapping  from  states  into 
actions,  but  a  mapping  from  belief  states  into  actions, 
where  belief  states  are  probability  distributions  over 
the  states.  A  POMDP  agent  or  controller  starts  with  a 
prior  belief  state  that  adjusts  as  a  result  of  the  actions 
it  performs  and  the  observations  it  gathers.  It  is  as¬ 
sumed  that  the  agent  has  a  model  of  both  the  actions 
and  the  sensors.  Formally,  a  goal  POMDP  is  defined  in 
terms  of; 

1.  states  s  £  S 

2.  actions  A(.s)  C  A  applicable  in  each  state  s 

3.  positive  costs  c(a,  s)  of  performing  action  a  in  s 

4.  transition  probabilities  Pa(s'|s)  of  ending  up 
in  state  s'  after  doing  action  a  £  A(.s)  in  state  s 

5.  initial  belief  state  bo 

6.  final  belief  states  bp 

7.  observations  o  in  state  s  after  action  a  with 
probabilities  Pa(o|.s) 

The  first  four  components  define  an  MDP  that  is  ex¬ 
tended  with  prior  and  final  beliefs,  and  a  sensor  model. 
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POMDPs  can  be  formulated  as  information  or  belief 
MDPs  in  which  states  are  replaced  by  belief  states 
(Sondik  1971;  Cassandra,  Kaebling,  &  Littman  1994). 
The  task  is  to  find  a  mapping  tt  from  belief  states  to 
actions  that  will  take  us  from  the  initial  belief  state 
bo  to  a  final  belief  state  bj?  at  a  minimum  expected 
cost.  The  way  actions  and  observations  affect  the  belief 
state  is  given  by  the  equations  (Cassandra,  Kaebling, 
&  Littman  1994): 

b„(s)  =  ^  p„(s|s')b(s')  (1) 

s'es 

ba(o)  =  ^P„(o|s)b„(s)  (2) 

aes 

^a(s)  =  -Pa(o|s)ba(s)/ba(o)  if  ba(o)  0  (3) 

where  ba  is  the  belief  state  that  results  after  doing 

action  a  in  b,  bo(o)  is  the  probability  of  observing  o 
after  doing  a  in  b,  and  b°  is  the  belief  state  that  re¬ 
sults  after  doing  action  a  in  b  and  then  observing  o. 
The  cost  c(a,  b)  of  an  action  a  in  b  is  the  weighted 
average  c(a,  s)b(s).  The  exception  are  the  final 
belief  states  bp  that  are  assumed  to  be  absorbing;  i.e., 
c(a,  bp)  is  defined  as  0,  and  ba  and  b°  are  defined  as  b, 
when  b  is  a  final  belief  state.  Finally,  the  set  of  actions 
A(b)  applicable  in  b  excludes  the  actions  a  that  are  not 
applicable  in  states  s  with  b(s)  >  0. 

Solving  belief  MDPs  is  difficult  and  until  recently  only 
very  small  problems  could  be  solved  reasonably  well 
especially  when  they  involved  information-gathering 
actions.  This  has  started  to  change  (Cassandra,  Kae¬ 
bling,  &  Littman  1995)  and  here  we  use  a  pomdp  al¬ 
gorithm  introduced  in  (Geffner  &  Bonet  1998b)  that 
is  based  on  the  ideas  of  Real  Time  Dynamic  Program¬ 
ming  (Barto,  Bradtke,  &  Singh  1995). 

RTDP-BEL  is  a  hill-climbing  algorithm  that  from  any 
state  b  searches  for  the  goal  states  bp  by  performing 
actions  a  that  lead  to  new  states  b°  with  probability 
ba{o)  (Figure  1).  Estimates  K(b)  of  the  expected  costs 
to  reach  bp  guide  the  search.  The  main  difference  with 
standard  hill-climbing  is  that  these  estimates  are  up¬ 
dated  dynamically.  Initially  V{b)  is  set  to  h{b),  where 
h  is  a  suitable  heuristic  function,  and  every  time  the 
state  b  is  visited  V (b)  is  updated  to  make  it  consistent 
with  the  values  V {b')  of  its  possible  successor  states 
b'  (Korf  1990).  In  the  implementation,  the  estimates 
V  (b)  are  stored  in  a  hash  table  that  initially  contains 
an  estimate  for  V (bo)-only.  Then  when  the  value  K(b') 
of  a  state  b'  that  is  not  in  the  table  is  needed,  a  new 
entry  with  V{b')  set  to  h{b')  is  created.  Usually  belief 
states  need  to  be  discretized  (Geffner  &  Bonet  1998b) 


1.  Evaluate  each  action  a  applicable  in  b  as 

Q(a,b)  =  cia,b) 

oeo 

initializing  V(b°)  to  /t(b“)  when  6“  not  in  table 

2.  Apply  action  a  that  minimizes  Q(a,b)  breaking 
ties  randomly 

3.  Update  V’(b)  to  Q(a,  b) 

4.  Observe  o 

5.  Compute  b^  using  Equations  1-3 

6.  Exit  if  b^  is  a  final  belief  state,  else  set  b  to  b^  and 
go  to  1 


Figure  1:  rtdp-bel 

but  this  is  not  needed  in  the  tasks  considered  in  this 
paper. 

RTDP-BEL  combines  search  and  simulation,  and  in  ev¬ 
ery  trial  selects  a  random  initial  state  s  with  proba¬ 
bility  bo(s)  on  which  the  effects  of  the  actions  applied 
by  RTDP-BEL  (Step  2)  are  simulated.  More  precisely, 
when  action  a  is  chosen,  the  current  state  s  in  the  simu¬ 
lation  changes  to  s'  with  probability  Pa(s'|s)  and  then 
produces  an  observation  o  with  probability  Po(o|s'). 
The  complete  RTDP-BEL  algorithm  is  shown  in  Fig.  1. 

4  SORTING 

The  sorting  problem  involves  arranging  a  vector  of 
numbers  in  increasing  order.  We  simplify  the  problem 
slightly  assuming  that  no  two  numbers  in  the  vector 
are  equal.  There  are  two  types  of  actions  available: 
swap{i,j)  that  exchanges  the  elements  in  positions  i 
and  j,  and  cmp{i,j)  that  tests  whether  the  element 
in  position  i  is  smaller  than  the  element  in  position 
j.  One  of  the  best  algorithms  for  sorting  is  Quicksort, 
which  takes  in  the  order  of  nlog(n)  operations  on  av¬ 
erage,  where  n  is  the  size  of  the  problem  (the  number 
of  elements  to  be  sorted). 

4.1  FORMULATION 

We  formulate  the  problem  as  a  goal  POMDP  in  which 
we  have  to  go  from  an  initial  belief  state  to  a  final  belief 
state  by  means  of  a  number  of  tests  and  swaps.  The 
state  s  reflects  the  way  in  which  the  elements  in  the 
input  vector  may  be  ordered;  for  example,  the  state 
s  =  [3, 1, 2]  for  n  =  3  says  that  the  first  element  in  the 
input  vector  is  the  third  smallest  element,  the  second 
element  is  the  smallest  element  of  all,  and  the  third  el- 
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ement  is  the  second  smallest  element.  More  generally, 
a  state  s  will  be  a  vector  of  size  n  such  that  .s[i]  =  j, 
for  1  <  *,  J  <  n  and  s[i]  s[j]  for  i  j.  The  meaning 
of  s[i]  =  j  is  that  the  i-th  element  in  the  inpnt  vector 
is  the  j-th  smallest  element. 

Given  an  input  vector,  there  is  a  single  state  that  is 
the  true  state  associated  with  the  input  vector  and 
the  swaps  performed.  The  actions  cmp[i,j)  yield  in¬ 
formation  about  such  state  and  the  actions  swap{i,j) 
mutate  it.  The  resulting  ‘sorting’  pomdp  for  a  partic¬ 
ular  problem  size  n  consists  of: 

1.  states  given  by  the  vectors  s  of  size  n  such  that 

s[i]  =  j  for  0  <  <  n  and  .s[t]  s[j]  if  i  j 

2.  actions  swap{i,  j)  and  cmp{i,  j)  for  0  <  t  <  j  <  n 

3.  transition  probabilities  Pa('S*ls)  ~  ^  ®  ~ 

cmp(i,j)  and  s'  =  s,  and  a  =  swap{i,j)  and  s' 
such  that  s'[j]  =  s[i],  s'[i]  =  s[j],  and  s'[fc]  =  s[k] 
for  k  ^  i,  k  ^  j.  Otherwise  Pa(s'ls)  =  0 

4.  action  costs  c(a,  s)  =  1  for  all  a  and  s 

5.  initial  belief  state  6o  uniform  over  all  states 

6.  final  belief  state  bp  for  which  (G)  =  1,  where 
s  =  G  is  the  sorted  state  for  which  s[i]  =  i  for 
i  =  1, . . .  ,n 

7.  observations  oi  =  (i  <  j)  or  02  =  {j  <  i) 
from  the  actions  a  =  test(i,j)  with  probabil¬ 
ities  Po(oi|s)  equal  to  1  (0)  when  s[i]  <  s[j) 
(s[i]  >  s[j]),  and  complementary  probabilities  for 
Pa(o2|s). 

4.2  IMPLEMENTATION 

Finding  a  policy  to  take  us  from  bo  to  bp  at  a  nearly  op¬ 
timal  expected  cost  is  difficult,  and  for  the  rtdp-bel 
algorithm  to  solve  this  problem  for  even  small  values 
of  n,  suitable  belief  representations  and  heuristic  func¬ 
tions  are  needed. 

4.2.1  Representation  of  Beliefs 

The  beliefs  b(s)  encode  the  probability  that  state  s 
represents  the  way  the  elements  in  the  input  are  or¬ 
dered.  For  a  sorting  problem  of  size  n,  the  size  of  the 
state  space  is  n!.  For  n  =  10,  this  means  10®  states. 
Such  large  state  spaces  introduce  problems  of  memory 
and  time  in  RTDP-BEL  and  other  pomdp  algorithms. 
Memory  is  a  potential  problem  as  in  the  worst  case 
the  size  of  the  hash  table  grows  with  the  size  of  the 


belief  space  which  is  in  the  order  of  2'"'.  This  prob¬ 
lem,  however,  can  be  ameliorated  by  the  use  of  good 
heuristic  functions  as  discussed  below. 

The  time  complexity  is  more  troublesome.  The 
RTDP-BEL  loop  involves  the  computation  of  the  be¬ 
lief  states  ba  and  b"  from  the  original  belief  state  b  as 
dictated  by  Equations  1-3.  In  the  worst  case  the  time 
for  these  computations  grows  with  \S\^  and  |5||0|  re¬ 
spectively.  If  belief  states  had  few  non-zero  entries,  a 
suitable  sparse  representation  co)ild  be  used,  but  this 
is  not  true  in  sorting  where  the  initial  belief  state  is 
uniform. 

The  representation  that  we  use  exploits  features  of 
the  sorting  problem  that  we  expect  would  arise  in 
other  tasks  as  well.^  First  of  all,  since  the  prior  is 
uniform  and  the  ‘sensors’  (i.e.,  tests)  are  noiseless, 
belief  states  b  can  be  represented  by  sets  of  states 
Si,  =  {s|b(,s)  >  0).  Indeed,  from  Bayes’  rule  it  follows 
that  b(.s)  =  l/|5b|  if  s  e  5b  and  b(s)  =  0  otherwise. 
Furthermore,  in  sorting  such  sets  can  be  conveniently 
encoded  by  collection  of  ‘links’  of  the  form  i  ->  j  for 
0  <i,j  <n,  where  each  link  i  j  is  a  constraint  that 
excludes  all  states  s  for  which  s[i]  ^  .s[ji].  The  initial 
belief  state  bo  is  represented  by  an  empty  set  of  such 
links,  while  the  representation  of  b”  is  obtained  from 
the  representation  of  bo  by  adding  the  link  i  -i-  j  if 
o  =  {i  <  j),  and  j  i  if  o  =  (j  <  i).  The  repre¬ 
sentation  of  ba  and  b  are  equal  for  a  —  cmp{i,j)  and 
the  first  is  obtained  from  the  second  by  exchanging  the 
occurrences  of  i  and  j  when  a  =  sii)ap{i,  j).  Our  imple¬ 
mentation  extends  this  idea  with  a  simple  mechanism 
that  removes  redundant  links  after  any  observation  (a 
link  is  redundant  when  it  can  be  inferred  by  transitiv¬ 
ity).  The  result  of  this  representation  is  that  we  reduce 
the  complexity  of  updating  beliefs  b  into  b"  from  |5|^ 
to  |0|  which  is  significantly  smaller. 

4.2.2  Updating  the  values  of  belief  states 

The  structures  used  to  represent  belief  states  need  to 
be  converted  into  numbers  for  computing  the  values 

Q{a.,b)  :=  c{a,b) +  '^V{bl)ba{o) 
o€0 

This  expression  involves  a  probability  ba(o)  that  has 
to  be  obtained  from  the  representation  of  ba-  One 
way  to  compute  ba(o)  is  by  computing  the  proportion 
of  states  s  in  ba  that  satisfy  o  (s  satisfies  {i  <  j)  if 

*In  particular  wc  expect  similar  ideas  to  apply  to  the 
problem  of  handling  continuous  attribvites  in  decision  tree 
learning,  but  wc  don’t  deal  with  such  problems  here. 
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s[i]  <  s[j]).  This  operation,  however,  is  very  costly  as 
it  grows  linearly  with  |5|.  For  this  reason  we  pursue  a 
different  approach  approximating  6a(o)  for  o  =  {*  < 
j)  as: 

{1  if  i  ->  j  in  ba 

0  if  j  ->  Hn  ba  (4) 

1  /2  otherwise 

where  i  j  is  in  ba  when  the  link  forms  part  of 
the  representation  of  ba  or  can  be  derived  from  such 
links  by  transitivity.  The  approximation  here  is  that 
probabilities  that  are  neither  0  nor  1  are  mapped  into 
1/2.  This  amounts  to  assuming  that  a  test  cmp{i,j) 
whose  outcome  is  not  predictable  can  go  either  way 
with  equal  probability.  This  assumption  is  not  true  in 
general  but  speeds  up  the  computation  and  does  not 
appear  to  do  harm,  as  it  is  approximately  correct  for 
the  tests  that  are  optimal.  We’ll  discuss  later  a  similar 
approximation  in  the  context  of  decision  tree  learning. 

4.2.3  Heuristic  Functions 

The  representation  of  beliefs  reduces  the  complexity 
of  updating  beliefs  b  into  6°,  while  the  approximation 
eliminates  the  cost  of  computing  the  probability  6a(o). 
Both  optimizations  together  speed  up  considerably  the 
inner  loop  of  the  RTDP-BEL  algorithm  that  selects  and 
applies  actions.  To  speed  up  the  solution  of  problems 
we  need  also  to  consider  and  apply  as  few  actions  as 
possible.  We  do  this  by  means  of  an  heuristic  function 
h(b)  that  provides  an  estimate  of  the  minimal  expected 
number  of  actions  needed  to  go  from  b  to  the  final 
belief  state  bp-  We  consider  the  combination  of  two 
heuristics: 

1.  the  longest  chain  heuristic  hi{b)  is  based  on  the 
longest  sequence  of  links  i-y  <  ii  <  iz  <  ■■■im 
that  appear  explicitly  in  the  representation  of  b, 
with  hi{b)  defined  as  n  —  m 

2.  the  number  of  misplaced  elements  heuristic  hm{b) 
applies  to  definite  belief  states  only;  i.e.,  those  b's 
such  that  b(s)  =  1  for  some  state  s.  In  such  a 
case  hjn{b)  is  defined  as  the  number  of  positions 
i  =  1, . . . ,  n,  for  which  s[i]  ^  i 

These  heuristics  are  not  admissible  in  the  sense  that 
they  may  overestimate  the  minimum  expected  cost  to 
the  goal,  and  as  a  result  may  prevent  the  estimates 
V  (b)  to  approach  the  optimal  values.^  Yet  the  admis¬ 
sible  heuristics  we  have  tried  were  not  as  informative, 

^See  (Barto,  Bradtke,  &  Singh  1995)  for  the  relation 
between  admissibility  and  optimality  in  RTDP  algorithms. 


1000  2000  3000  4000  5000  6000  7000  8000  9000  10000 
trials 


Figure  2:  Average  number  of  actions  vs  Number  of 
Trials  for  sorting  problems  of  sizes  n  =  5  and  n  =  10. 
Top  line  is  the  curve  for  Quicksort. 

led  the  algorithm  to  visit  too  many  belief  states,  and 
in  general  resulted  in  memory  problems. 

A  final  point  about  the  implementation  is  that  we  im¬ 
pose  the  precondition  that  the  ordering  between  the 
elements  at  positions  i  and  j  be  known  before  consid¬ 
ering  a  swap  between  them.  This  is  done  by  making 
an  action  swap{i,j)  applicable  in  b  only  when  a  link 
i  -¥  j  or  j  — t  2  is  in  the  representation  of  b.  This 
condition  tends  to  reduce  the  branching  factor  of  the 
problem  which  is  still  large  as  it  grows  linearly  with  n. 

4.3  EVALUATION 

We  tried  the  above  implementation  of  the  RTDP-BEL 
algorithm  on  sorting  problems  of  two  sizes.  Figure  2 
shows  the  performance  of  the  sorting  policies  com¬ 
puted  by  RTDP-BEL  for  problems  of  size  n  =  5  and 
compares  them  with  the  ones  obtained  by  Quicksort. 
The  y-axis  measures  the  average  number  of  actions 
performed  and  the  x-axis  the  number  of  trials.  For 
n  =  5,  there  are  5!  =  120  states,  20  actions,  and  40 
observations.  The  curves  for  RTDP-BEL  correspond  to 
the  heuristic  h  =  0,  h  =  hi  and  the  decomposition 
method  to  be  explained  below.  The  point  at  trial  i 
for  i  =  1000, 2000, 3000, . . .  10000,  indicates  the  aver¬ 
age  cost  to  reach  the  goal  over  1000  simulations  using 
the  greedy  policy  determined  by  the  estimates  in  the 
table  at  trial  i.  rtdp-bel  shows  improvement  with 
the  heuristics  h  =  0  and  hi  but  no  improvement  with 
the  decomposition  method.  In  all  cases  they  arrive 
to  an  expected  cost  that  is  slightly  below  11  which  is 
half  the  expect  cost  incurred  by  Quicksort  (which  is 
the  top  line  in  the  figure).  A  run  of  10000  trials  with 
h  —  0  takes  in  the  order  of  1.36  minutes  and  leaves 
4230  entries  in  the  hash  table.  The  heuristic  hi  and 
the  decomposition  method  are  slightly  faster. 
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For  larger  sizes,  neither  of  the  two  heuristics  h  =  0  nor 
h  =  hi  scale  up,  and  only  the  decomposition  method 
works.  We  tried  this  method  for  n  =  10  that  generates 
a  POMDP  with  several  million  states,  45  actions  and  90 
observations.  The  resulting  curve  is  flat  with  a  cost  of 
37.  The  average  curve  for  Quicksort  is  also  flat  with 
an  average  cost  of  64.  The  idea  of  the  decomposition 
method  is  the  following:  the  sorting  problem  is  divided 
into  two  subproblems  by  introducing  the  definite  be¬ 
lief  states  b'p  as  subgoals,  where  the  b'p's  are  such  that 
b'p{s)  =  1  for  some  s.  We  deal  with  the  problem  of 
going  from  bo  to  some  b'p,  and  from  b'p  to  bp  sepa¬ 
rately.  That  is,  each  subproblem  has  its  own  heuristic 
function  and  its  own  hash  table.  The  second  subprob¬ 
lem  is  triggered  after  a  belief  b'p  is  obtained  with  b'p  as 
the  initial  belief  state.  For  the  first  subprohlem,  the 
heuristic  hi  is  used,  while  for  the  second  subprohlem, 
h,n  is  used. 

For  both  n  =  5  and  n  =  10  the  resulting  curves  for 
the  decomposition  method  are  practically  flat.  This 
means  that  the  resulting  algorithm  starts  off  well  but 
then  does  not  improve.  As  mentioned  above  this  is  the 
result  of  the  non-admissibility  of  the  heuristics  hi  and 
hm  for  each  of  the  two  subproblems.  We  actually  ran 
the  same  algorithm  for  both  values  of  n  eliminating 
the  update  step  in  RTDP-BEL.  The  resulting  algorithm 
is  a  purely  greedy  algorithm  and  produced  the  same 
results  while  consuming  constant  memory  (the  table 
with  the  estimates  is  not  needed).  However  even  this 
simplification  is  not  good  for  very  large  values  of  n  as 
the  branching  factor  (the  number  of  actions)  grows  lin¬ 
early  with  n.  For  such  problems  other  optimizations 
are  needed.  An  alternative  that  we  have  considered  is 
the  use  of  ‘indexicals’  to  control  the  actions  that  can 
be  considered  at  any  given  point.  The  indexicals  in 
this  problem  can  be  just  a  pair  of  vector  subscripts 
so  that  only  comparisons  and  swaps  of  elements  with 
those  subscripts  can  be  considered,  in  addition  to  the 
operation  of  incrementing  and  decrementing  those  in¬ 
dices.  Schemes  such  as  these  reduce  the  branching 
factor  of  the  problem  but  push  the  solutions  deeper  in 
search  space.  Whether  and  when  such  tradeoff  speeds 
up  computation  remains  an  open  question. 

4.4  SUMMARY 

Sorting  is  a  challenging  problem  that  can  be  effec¬ 
tively  modeled  and  solved  as  a  pomdp  provided  suit¬ 
able  heuristics,  representations  and  decompositions 
are  used.  In  this  way  we  have  solved  a  pomdp  that 
involves  millions  of  states  and  have  obtained  solutions 
that  compare  favorably  with  Quicksort  in  terms  of  the 


number  of  steps.  The  obvious  weakness  of  the  resulting 
sorting  policy  is  that  it  applies  to  a  particular  prob¬ 
lem  size.  An  interesting  challenge  is  the  extraction  of 
a  concise  and  generalized  representation  of  the  policy 
that  cotild  be  applied  to  problems  of  any  size. 

5  DECISION  TREES 

Decision  trees  are  classifiers  that  map  instances  into 
cla-sses  by  sequentially  testing  the  value  of  a  finite  set 
of  attributes  (Mitchell  1997).  The  standard  way  to 
learn  decision  trees  from  data  is  by  a  top-down  greedy 
strategy  in  which  the  attribute  that  is  most  informa¬ 
tive  for  cla-ssification  according  to  the  data  is  used  to 
split  the  data  first,  and  for  each  possible  outcome,  the 
attribute  that  is  most  informative  according  to  the 
remaining  data  is  used  second  and  so  on,  until  ei¬ 
ther  there  are  no  more  data  or  no  more  uncertainty 
regarding  the  classification  (Breiman  et  al.  1984; 
Quinlan  1993).  The  generalization  power  of  decision 
tree  algorithms  is  measured  by  the  cla.ssification  error 
over  part  of  the  data  that  is  left  aside  for  testing.  De¬ 
cision  tree  learning  algorithms  have  been  applied  to  a 
number  of  domains  (Murthy  1998)  and  a  number  of 
variations  and  extensions  have  been  considered  (Diet- 
triech  1997). 

5.1  FORMULATION 

The  problem  of  learning  decision  trees  can  be  seen  as 
a  sequential  decision  problem  that  involves  two  types 
of  actions:  report[i)  by  which  the  current  instance  s  is 
classified  in  class  Ci,  and  test{j)  by  which  the  attribute 
tj  of  s  is  observed.  The  goal  is  to  have  the  instance 
s  classified,  and  this  can  be  achieved  by  any  of  the 
actions  r€port(i),  i  =  l,...,n  where  n  is  the  num¬ 
ber  of  classes.  The  expected  cost  associated  with  such 
actions  depends  on  the  true  class  of  s.  The  actions 
test(j)  provide  information  about  s.  The  ‘classifica¬ 
tion’  POMDP  consists  thus  of: 

1.  states  s  that  are  the  instanees  in  the  training  set 
supplemented  by  a  separate  goal  state  G 

2.  actions  report[i)  for  each  of  the  classes  Cj,  and 
test[j),  for  each  of  the  attributes  tj 

3.  transition  probabilities  Po(.s'|.s)  =  1  for  a  = 
test(j)  and  s'  =  ,s,  and  a  =  report{i)  and  s'  =  G. 
Otherwise  Pa(,s').s)  =  0 

4.  action  costs  c{report{i),s)  =  Cij  for  class(s)  = 
Cj  and  c(test{j),s)  =  Gj  for  all  s 
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5.  initial  belief  state  bo  uniform  over  the  non-goal 
states  and  zero  over  the  goal  state 

6.  final  belief  state  bp  for  which  b^{G)  =  1 

7.  observations  o  after  action  a  =  test{j)  with 
probabilities  Pa(o|s)  =  1  if  o  =  Uj(s)  and  0  oth¬ 
erwise,  where  Vj{s)  stands  for  the  value  of  s  over 
the  attribute  tj 

The  POMDP  formulation  suggests  generalizations  of  the 
standard  decision  tree  learning  setting  such  as  different 
test  and  misclassification  costs  Cj  and  Cij,  noisy  tests 
with  Pa(o|s)  G  [0, 1],  etc.  By  default  we  assume  here 
that  the  cost  of  tests  and  correct  classifications  is  1, 
while  the  cost  Cij  of  misclassifications  for  i  ^  j,  is 
some  constant  C  >  1. 

5.2  IMPLEMENTATION 

We  represent  belief  states  as  sets  of  states  (training 
set  instances),  taking  advantage  of  the  the  uniform 
prior  over  the  instances  and  the  noiseless  ‘sensors’. 
With  this  representation,  the  complexity  of  a  single 
RTDP-BEL  cycle  reduces  from  iS'p  to  15|.  The  value 
ba{o)  for  a  =  test{j)  in  Equation  2  is  obtained  as  the 
proportion  of  states  s  in  6  for  which  Vj{s)  =  o,  a  pro¬ 
portion  that  is  computed  as  |6o|/|6|. 

We  use  the  non-informative  heuristic  h  =  0.  Heuris¬ 
tics  based  on  measures  such  as  information  gain  (Quin¬ 
lan  1990)  could  be  used  as  well  but  they  only  make  a 
difference  in  the  first  trials  of  RTDP-BEL  as  they  are 
not  calibrated  with  the  expected  classification  costs. 
It  may  be  possible  to  calibrate  such  heuristics  to  ac¬ 
celerate  convergence  but  we  don’t  know  how  to  do  that 
yet. 

5.3  EVALUATION 

Table  1  compares  RTDP-BEL  with  two  standard  deci¬ 
sion  tree  learning  algorithms,  IDS  and  C4.5  (Quinlan 
1990;  1993)  over  some  small  datasets  obtained  from 
the  UCI  Repository  (Murphy  &  Aha  1998)  for  two 
different  misclassification  costs  C.^  For  each  dateiset, 

^The  figures  for  IDS  and  C4.5  were  taken  from  (Fried¬ 
man,  Kohavi,  &  Yun  1996).  The  column  named  ‘Test’ 
in  the  table  indicates  how  the  generalization  performance 
of  the  algorithms  was  measured.  The  Monk-n  datasets 
come  with  separate  training  and  test  data;  on  the  other 
two  problems  the  test -data  was  generated  by  5-fold  cross 
validation:  the  data  were  partitioned  into  five  segments, 
and  fives  runs  were  performed  by  leaving  one  different  seg¬ 
ment  as  test  data.  The  results  are  the  averages  over  these 
fives  runs. 


we  constructed  the  corresponding  pomdp  and  ran  the 
RTDP-BEL  algorithm  with  the  non-informative  heuris¬ 
tics  h  =  0  for  10000  trials.  The  curve  in  Figure  3  shows 
the  average  classification  accuracy  as  a  function  of  the 
number  of  trials  in  the  Monk-1  and  Monk-2  datasets. 
A  run  of  10000  trials  over  the  Monk  datasets  takes  a 
few  minutes  on  average  and  leaves  a  few  thousand  en¬ 
tries  in  the  hash  table.  For  the  larger  Votes  dataset, 
the  run  takes  24  minutes  on  average  and  leaves  around 
16000  entries  in  the  hash  table.  During  testing,  when¬ 
ever  a  new  belief  state  b°  was  generated  that  was  not  in 
the  hash  table,  b°  was  approximated  to  b.  This  means 
that  unexpected  values  in  the  test  set  are  regarded  as 
‘missing’  values.  This  is  not  too  different  from  the 
approach  taken  in  decision  tree  learning  when  test  in¬ 
stances  get  to  a  node  with  no  compatible  branches, 
and  are  classified  by  the  distribution  of  instances  in 
that  node. 


5.3.1  Missing  Values 

In  the  presence  of  missing  values  in  the  training  set, 
the  sum  of  the  beliefs  60(0)  over  the  real  observations 
o  may  fail  to  add  up  to  1  due  to  the  mass  ba{m)  ^  0 
over  the  missing  values.  In  such  cases,  the  beliefs  ba(o) 
are  normalized  by  dividing  them  by  the  sum  6a  (oi) 
teiken  over  the  real  observations  Oj.  This  amounts  to 
assuming  that  having  ‘observed’  a  missing  value  m  is 
like  having  observed  a  real  observation  o,  with  proba¬ 
bility  ba(oi).  This  implies  that  6”  =  6a,  in  agreement 
with  the  interpretation  of  missing  values  as  missing  ob¬ 
servations.  The  dataset  Votes  in  Table  1  has  missing 
values. 


5.3.2  Misclassification  Costs  and  Overfitting 

As  expected,  misclassification  costs  have  an  influence 
on  the  level  of  overfitting  in  noisy  datasets.  Very  high 
misclassification  costs  induce  the  algorithm  to  fit  the 
training  data  as  much  as  possible,  which  in  those  cases 
may  increment  the  error  rate  on  the  test  set.  This  can 
be  seen  in  the  last  row  in  Table  1,  where  the  error  rate 
in  the  Votes  data  set  goes  up  by  almost  10  points  when 
the  misclassification  costs  are  increased  from  C  =  25 
to  C  =  10000.  In  general  these  costs  do  not  have  to 
be  all  equal  and  can  be  tuned  to  produce  a  minimal 
error  rate  by  leaving  aside  part  of  the  training  data 
for  that  purpose.  In  other  problems  (e.g.,  medicine), 
these  costs  can  be  chosen  to  approximate  the  real  mis¬ 
classification  costs. 
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Cla-wificaiion  Accuracy  for  Monk-1  Dausci  C1a.«irication  Accuracy  for  Monk-2  Daiascl 


Figure  3:  Classification  Accuracy  vs.  Trials  for  Monk-1  and  Monk-2 


Table  1:  Accuracy  after  10000  trials  compared  with  IDS  and  C4.5 


Dataset 

Feat. 

Miss 

Tiain 

Test 

ID3 

C4.5 

RTDP 

C  =  25  C  =  10000 

monk-1 

6 

no 

124 

432 

81.25  db  1.89 

75.70  ±2.07 

97.39  ±0.29 

97.39  ±  0.35 

monk- 2 

6 

no 

169 

432 

69.91  ±  2.21 

65.00  ±  2.30 

64.42  ±1.13 

64.40  ±0.81 

monk- 3 

6 

no 

122 

432 

90.28  ±  1.43 

97.20  ±  0.80 

95.16  ±0.49 

94.33  ±  0.78 

hayes-roth 

4 

no 

160 

CV-5 

68.75  ±  8.33 

74.38  ±  4.24 

77.70  ±4.65 

72.04  ±  5.44 

votes 

16 

yes 

435 

CV-5 

93.10  ±2.73 

95.63  ±  0.43 

94.42  ±  1.88 

83.12  ±6.75 

5.3.3  Approximations 

In  another  set  of  experiments  we  introduced  an  ap¬ 
proximation  in  the  evaluation  of  the  probability  ba{o), 
which  in  this  case  stands  for  the  probability  of  ob¬ 
serving  a  value  vj  after  testing  an  attribute  tj  in  a 
given  context.  The  exact  value  of  6a(o)  is  given  by  the 
number  of  instances  in  b  whose  attribute  tj  has  value 
Vj  over  the  total  number  of  instances  in  b.  Following 
a  similar  approximation  used  in  sorting,  we  approxi¬ 
mated  ba(o)  uniformly  as  1/n,  where  n  is  the  number 
of  values  that  attribute  tj  takes  in  the  training  set. 
As  before  the  intuition  was  that  the  best  action  would 
be  the  most  informative  and  would  tend  to  split  the 
data  in  that  way.  The  results  confirmed  this  intuition 
and  matched  up  almost  exactly  the  ones  reported  in 
Table  1.  The  CPU  times  were  reduced  three  times 
on  average.  Yet  even  with  this  approximation,  larger 
datasets  could  not  be  handled  as  memory  tends  to  ex¬ 
plode.  The  main  problem  is  the  lack  of  an  informa¬ 
tive  heuristic  that  can  guide  the  search,  while  leaving 
a  large  fraction  of  the  (belief)  state  space  unvisited. 
Heuristics  such  as  ‘information  gain’  (Quinlan  1990) 
are  informative  but  are  not  calibrated  with  the  ex¬ 
pected  costs.®  As  a  result,  they  produce  a  focused 


®That  is,  information  gain  is  not  a  good  estimate  of  the 
expected  costs. 


search  for  the  goal  in  the  first  few  trials,  but  then  be¬ 
come  useless  as  some  of  the  heuristic  values  are  re¬ 
placed  (updated)  by  cost  estimates.  It  seems  that  it 
should  be  possible  to  speed  up  the  convergence  of  RTDP 
algorithms  by  the  use  of  uncalibrated  heuristics,  but 
how  to  do  that  appears  to  be  an  open  question. 

5.4  SUMMARY 

We  have  .shown  that  decision  tree  induction  can  be 
modeled  and  solved  as  a  POMDP  problem  and  that  .so¬ 
lutions,  while  more  expensive  to  compute,  may  com¬ 
pete  in  quality  with  the  standard  approaches.  POMDPs 
may  provide  a  fresh  perspective  on  the  problem  of 
inferring  decision  trees  from  data  ^us  aspects  such  as 
noisy  tests  and  data,  tests  and  misclassification  costs, 
and  missing  values,  fit  into  the  POMDP  approach  in 
a  natural  way.  The  POMDP  algorithm  used,  however, 
does  not  scale  up  yet  to  large  datasets  involving  many 
attributes,  nor  does  it  apply  to  datasets  involving  con¬ 
tinuous  attributes. 

6  CONCLUSIONS 

We  aimed  to  show  two  things.  One  is  that  POMDPs 
can  be  used  to  solve  complex  problems  of  sequential 
decision  by  the  use  of  suitable  heuristics,  representa- 
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tions,  and  decompositions.  The  second  is  that  pomdps 
provide  a  novel  perspective  on  the  problem  of  inferring 
decision  trees  from  data  that  may  be  worth  exploring 
in  further  depth.  We  have  been  able  to  solve  very  large 
POMDPS  with  million  of  states  and  obtain  solutions 
that  compete  in  quality  with  those  produced  by  some 
of  the  best  algorithms  (Quicksort,  C4.5).  We  expect 
that  some  of  the  lessons  learned  will  be  applicable  to 
other  problems  such  as  the  problem  of  handling  contin¬ 
uous  attributes  in  decision  tree  learning  that  appears 
to  have  many  aspects  in  common  with  sorting.  We  also 
think  that  the  POMDP  methods  used  in  this  paper  can 
be  refined  so  that  larger  datasets  could  be  handled. 
A  number  of  interesting  questions  that  may  be  rele¬ 
vant  for  the  application  of  POMDP  methods  to  other 
problems  remain  open;  e.g.,  how  can  sorting  policies 
be  generalized  to  arbitrary  array  sizes,  whether  mis- 
classification  costs  can  be  used  effectively  to  deal  with 
the  problem  of  overfitting,  how  uncalibrated  heuristics 
can  be  used  to  speed  up  converge  of  rtdp  algorithms, 
etc. 
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Abstract 


Computational  comparison  is  made  between 
two  feature  selection  approaches  for  finding  a 
separating  plane  that  discriminates  between 
two  point  sets  in  an  n-dimensional  feature 
space  that  utilizes  as  few  of  the  n  features 
(dimensions)  as  possible.  In  the  concave  min¬ 
imization  approach  [19,  5]  a  separating  plane 
is  generated  by  minimizing  a  weighted  sum  of 
distances  of  misclassified  points  to  two  par¬ 
allel  planes  that  bound  the  sets  and  which 
determine  the  separating  plane  midway  be¬ 
tween  them.  Furthermore,  the  number  of  di¬ 
mensions  of  the  space  used  to  determine  the 
plane  is  minimized.  In  the  support  vector 
machine  approach  [27,  7,  1,  10,  24,  28],  in 
addition  to  minimizing  the  weighted  sum  of 
distances  of  misclassified  points  to  the  bound¬ 
ing  planes,  we  also  maximize  the  distance  be¬ 
tween  the  two  bounding  planes  that  generate 
the  separating  plane.  Computational  results 
show  that  feature  suppression  is  an  indirect 
consequence  of  the  support  vector  machine 
approach  when  an  appropriate  norm  is  used. 
Numerical  tests  on  6  public  data  sets  show 
that  classifiers  trained  by  the  concave  min¬ 
imization  approach  and  those  trained  by  a 
support  vector  machine  have  comparable  10- 
fold  cross-validation  correctness.  However,  in 
all  data  sets  tested,  the  classifiers  obtained  by 
the  concave  minimization  approach  selected 
fewer  problem  features  than  those  trained  by 
a  support  vector  machine. 


1  INTRODUCTION 

The  feature  selection  problem  addressed  here  is  that 
of  discriminating  between  two  finite  point  sets  in  n- 
dimensional  feature  space  J?"  by  a  separating  plane 
that  utilizes  as  few  of  the  features  as  possible. 

Classification  performance  is  determined  by  the  in¬ 
herent  class  information  available  in  the  features  pro¬ 
vided.  It  seems  logical  to  conclude  that  a  large  number 
of  features  would  provide  more  discriminating  ability. 
But,  with  a  finite  training  sample,  a  high-dimensional 
feature  space  is  almost  empty  [12]  and  many  separators 
may  perform  well  on  the  training  data,  but  few  may 
generalize  well.  Hence  the  importance  of  the  feature 
selection  problem  in  classification  [15].  The  optimiza¬ 
tion  formulations  in  Section  2  exploit  one  realization 
of  the  Occam’s  Razor  bias  [3]:  compute  a  separat¬ 
ing  plane  with  a  small  number  of  predictive  features, 
discarding  irrelevant  or  redundant  features.  These  for¬ 
mulations  can  be  considered  wrapper  models  as  defined 
in  [14]. 

The  first  approach  [19,  5],  described  in  Section  2,  in¬ 
volves  the  minimization  of  a  concave  function  on  a 
polyhedral  set.  A  plane  is  constructed  such  that  a 
weighted  sum  of  distances  of  misclassified  points  to 
the  plane  is  minimized  and  as  few  dimensions  of  the 
original  feature  space  i?"  are  used.  This  is  achieved 
by  constructing  two  parallel  bounding  planes,  in  as 
small  dimensional  space  as  possible,  that  bound  each 
of  the  two  sets  to  the  extent  possible  by  placing  the 
two  sets  on  two  opposite  halfspaces  determined  by  the 
two  planes.  The  two  planes  are  determined  such  that 
the  sum  of  weighted  distances  of  points  in  the  wrong 
halfspace  to  the  bounding  plane  is  minimized.  This 
leads  to  the  minimization  of  a  concave  function  on  a 
polyhedral  set  (problems  (6)  and  (8)  below)  for  which 
a  stationary  point  can  be  obtained  a  successive  lin- 
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earization  algorithm  (Algorithm  2.1  below).  The  fi¬ 
nal  separating  plane  is  taken  midway  between  the  two 
bounding  parallel  planes. 

The  second  approach,  that  of  a  support  vector  ma¬ 
chine  [27,  7,  1,  10,  24,  28],  described  in  Section  3,  con¬ 
structs  two  parallel  bounding  planes  in  n-dimensional 
space  i?"  as  in  the  first  approach  outlined  above,  but 
in  addition  attempts  to  push  these  planes  as  far  apart 
as  possible.  The  justification  for  this,  apart  from  re¬ 
ducing  the  VC  dimension  [27]  which  in  turn  improves 
generalization,  is  that  for  the  linearly  separable  case, 
the  further  apart  the  planes,  the  smaller  the  halfspace 
assigned  to  each  of  the  two  sets,  reducing  the  possi¬ 
bility  that  new  unseen  points  from  the  wrong  set  lie 
in  that  halfspace.  Although  improved  generalization 
is  the  primary  purpose  of  the  support  vector  machine 
formulation,  it  turns  out  that  the  linear  program  (13) 
resulting  from  employing  the  oo-norm  to  measure  the 
distance  between  the  two  bounding  planes,  leads  also 
to  a  feature  selection  method,  whereas  the  linear  pro¬ 
gram  resulting  from  the  use  of  the  1-norm  (12)  and 
the  quadratic  program  resulting  from  the  2-norm  (14) 
do  not  lead  to  feature  selection  methods. 

In  Section  4  we  describe  our  computational  experi¬ 
ments  on  6  publicly  available  data  sets  using  the  ap¬ 
proaches  described  in  Sections  2  and  3.  The  goal 
is  to  evaluate  the  generalization  ability  of  classifiers 
trained  by  solving:  the  concave  optimization  problem 
(8),  three  versions  of  the  support  vector  machine  prob¬ 
lem  with  different  norms  (12),  (13),  (14)  as  well  as  the 
robust  linear  program  RLP  (4).  RLP,  which  underlies 
the  proposed  feature  selection  methods  here,  has  no 
feature  suppression  capability  built  in.  We  measure 
generalization  ability  by  10-fold  cross-validation  [26]. 
Numerical  tests  on  6  public  data  sets  show  that  clas¬ 
sifiers  trained  by  the  concave  minimization  approach 
and  those  trained  by  a  support  vector  machine  have 
comparable  10-fold  cross-validation  correctness.  How¬ 
ever,  in  all  data  sets  tested,  the  classifiers  obtained 
by  the  concave  minimization  approach  selected  fewer 
problem  features  than  those  trained  by  a  support  vec¬ 
tor  machine.  Further,  computational  time  for  the 
normally  used  quadratic  programming  approach  for 
SVMs,  was  orders  of  magnitude  larger  than  the  pro¬ 
posed  linear  programming  approaches. 

We  now  describe  our  notation  and  give  some  back¬ 
ground  material.  All  vectors  will  be  column  vectors 
unless  transposed  to- a  row  vector  by  a  superscript  T. 
For  a  vector  a;  in  R",  |a;|  will  denote  a  vector  in  R"  of 
absolute  values  of  the  components  of  x.  For  a  vector 
X  £  R",  denotes  the  vector  in  R"  with  components 


max{0,  Xj}.  For  a  vector  x  £  R”,  x,  denotes  the  vec¬ 
tor  in  R"  with  components  (x,)i  =  1  if  Xi  >  0  and 
0  otherwise  (i.e.  x*  is  the  result  of  applying  the  step 
function  component- wise  to  x).  The  base  of  the  nat¬ 
ural  logarithm  will  be  denoted  by  s,  and  for  a  vector 
y  £  R”,  will  denote  a  vector  in  R™  with  compo¬ 
nents  £~^' ,  i  =  1, . . .  ,  m.  For  x  £  R"  and  1  <  p  <  oo: 


a;  D  = 


||x||oo  =  \xj 

l<J<n 


For  a  general  norm  jj  •  ||  on  R",  the  dual  norm  ||  ■  ||'  on 
R"  is  defined  as 


||x||'  =  max  x'y. 
IMI=i 


The  1-norm  and  oo-norm  are  dual  norms,  and  so  are 
a  p-norm  and  a  g-norm  for  which  1  <  p,  q  <  oo  and 
^  H-  ^  =  1.  The  notation  A  £  R'"^"  will  signify  a 
real  m  x  n  matrix.  For  such  a  matrix  will  denote 
the  transpose  of  A  and  Ai  will  denote  the  i-th  row 
of  A.  A  vector  of  ones  in  a  real  space  of  arbitrary 
dimension  will  be  denoted  by  e.  A  vector  of  zeros  in 
a  real  space  of  arbitrary  dimension  will  be  denoted  by 
0.  The  notation  argmin/(x)  will  denote  the  set  of 

xSS 

minimizers  of  /(x)  on  the  set  S.  A  separating  plane, 
with  respect  to  two  given  point  sets  A  and  R  in  R”,  is  a 
plane  that  attempts  to  separate  R"  into  two  halfspaces 
such  that  each  open  halfspace  contains  points  mostly 
of  A  or  B. 

2  FSV:  FEATURE  SELECTION  VIA 
CONCAVE  MINIMIZATION 

In  this  part  of  the  paper  we  describe  a  feature  selection 
procedure  that  has  been  effective  in  medical  and  other 
applications  [5,  19]. 

Given  two  point  sets  A  and  B  in  R"  represented  by 
the  matrices  A  £  and  B  £  R*x"  respectively, 

we  wish  to  discriminate  between  them  by  a  separating 
plane: 


R  =  {x  I  X  £  R”,  x'^w  =  7}, 


(1) 


with  normal  w  £  R"  and  1-norm  distance  to  the  origin 

I7I 

[20].  We  shall  attempt  to  determine  w  and  7 


of 


so  that  the  separating  plane  P  defines  two  open  halfs¬ 
paces  {x  I  X  £  BP,x^w  >  7}  containing  mostly  points 
of  A,  and  {x  |  x  £  R",  x^w  <  7}  containing  mostly 
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points  of  B.  Hence,  upon  normalization,  we  wish  to 
satisfy 

Aw  >  ey  +  e,  Bw  <ey  —  e.  (2) 

to  the  extent  possible.  Conditions  (2)  can  be  satisfied 
if  and  only  if,  the  convex  hulls  of  A  and  B  are  disjoint. 
This  is  not  the  case  in  many  real-world  applications. 
Hence,  we  attempt  to  satisfy  (2)  in  some  “best”  sense 
by  minimizing  some  norm  of  the  average  violations  of 
(2)  such  as 


min  /(iu,7)  =  min  — ||(-i4tn -f- 67 -h  e)+||i 

t£i,7  tu,7  m 

+  ^  IKHii;  -  67-1- e)+ 111.  (3) 

Recall  that  for  a  vector  x,  x+  denotes  the  vector  with 
components  max{0,Xi}.  Two  principal  reasons  for 
choosing  the  1-norm  in  (3)  are:  (1)  problem  (3)  is 
then  reducible  to  a  linear  program  (4)  with  many  im¬ 
portant  theoretical  properties  making  it  an  effective 
computational  tool  [2],  (2)  the  1-norm  is  less  sensitive 
to  outliers  such  as  those  occurring  when  the  underly¬ 
ing  data  distributions  have  pronounced  tails,  hence  (3) 
has  a  similar  effect  to  that  of  robust  regression  [13],[11, 
pp  82-87]. 

The  formulation  (3)  is  equivalent  to  the  following  ro¬ 
bust  linear  programming  formulation  (RLP)  proposed 
in  [2]  and  effectively  used  to  solve  problems  from  real- 
world  domains  [21]: 

minimize  -b 

vi,'r,y,z 

-Aw  +  e'y  +  e  <y,  (4) 

subject  to  Bw  —  e'y  +  e  <  z, 
y  >0,z  >0. 

The  linear  program  (4)  or,  equivalently,  the  formu¬ 
lation  (3),  define  a  separating  plane  P  that  approx¬ 
imately  satisfies  the  conditions  (2)  in  the  following 
sense.  Each  positive  value  of  yi  determines  the  dis¬ 
tance  [20,  Theorem  2.2]  between  a  point  A.  of 
A  lying  on  the  wrong  side  of  the  bounding  plane 
x'^w  =  7  -f  1  for  A,  that  is  A,  lying  in  the  open  halfs¬ 
pace 

{a:  I  x'^w  <  7  -b  1 }, 

and  the  bounding  plane  x'^w  =  7  -b  1.  Similarly  for 
B  and  x'^w  =  7  -  1.  Thus  the  objective  function  of 


the  linear  program  (4)  minimizes  the  average  sum  of 
distances,  weighted  by  ||«;||',  of  misclassified  points  to 
the  bounding  planes.  The  separating  plane  P  (1)  is 
midway  between  the  two  bounding  planes  and  parallel 
to  them. 

Feature  selection  [19,  5]  is  imposed  by  attempting  to 
suppress  as  many  components  of  the  normal  vector 
w  to  the  separating  plane  P  that  is  consistent  with 
obtaining  an  acceptable  separation  between  the  sets 
A  and  B.  We  achieve  this  by  introducing  an  extra 
term  with  parameter  A  G  [0, 1)  into  the  objective  of 
(4)  while  weighting  the  original  objective  by  (1  -  A)  as 
follows: 


minimize  (1  -  A)  -b  -b  Ae^|u;|» 

-Aw  -b  67  +  6  <  1/,  (5) 

subject  to  Bw  —  ey  +  e  <  z, 

y  >0,z  >0. 

Note  that  the  vector  |it;|.  €  R"  has  components  which 
are  equal  to  1  if  the  corresponding  components  of  w 
are  nonzero  and  components  equal  to  zero  if  the  cor¬ 
responding  components  of  w  are  zero.  Recall  that  e 
is  a  vector  of  ones  and  e^|w|,  is  simply  a  count  of 
the  nonzero  elements  in  the  vector  w.  Problem  (5) 
balances  the  error  in  separating  the  sets  A  and  B, 
(  6*^  V  z  \ 

I  — -  -\ - ) ,  and  the  number  of  nonzero  elements 

\  m  k  J 

of  in,  (e^|in|.).  Further,  if  an  element  of  w  is  zero,  the 
corresponding  feature  is  removed  from  the  problem. 

By  introducing  the  variable  v  we  are  able  to  eliminate 
the  absolute  value  from  problem  (5)  which  leads  to 
the  following  equivalent  parametric  program  (for  A  G 
[0,1)): 


minimize  (1  —  A)  +  Ae^n, 

w,y,y,z,v  \  / 

-Aw  +  ey  +  e<y, 

,  .  Bw  -  ej  +  e<  z,  '■O'' 

subject  to  y>0,^>0, 

—v<w<v. 

Since  v  appears  positively  weighted  in  the  objective 
and  is  constrained  by  —  u  <  in  <  u,  it  effectively  mod¬ 
els  the  vector  |in|.  This  feature  selection  problem  will 
be  solved  for  a  value  of  A  G  [0, 1)  for  which  the  result¬ 
ing  classification  obtained  by  the  separating  plane  (1) 
midway  between  the  bounding  planes  x'^w  =  7  ±  1, 
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generalizes  best,  estimated  by  a  cross-validation  tun¬ 
ing  procedure.  Typically  this  will  be  achieved  in  a  fea¬ 
ture  space  of  reduced  dimensionality,  that  is  <  n 
(i.e.  the  number  of  features  used  is  less  than  n). 

Because  of  the  discontinuity  of  the  step  function  term 
e^Vt,  we  approximate  it  by  a  concave  exponential  on 
the  nonnegative  real  line  [19].  The  approximation  of 
the  step  vector  u*  of  (6)  by  the  concave  exponential  : 

u*  t(u,a)  =  e  -  e““‘',a  >  0,  (7) 

leads  to  the  smooth  problem  (FS V :Feature  Selection 
Concave): 

(1  -  ^)  (=^+ 4*) -=-“•) 

—Aw  +  e'y  +  e  <  y, 

Bw  —  e'y  +  e  <  z, 
y>0,z>0, 

—v<w<v. 

(8) 

It  can  be  shown  [4,  Theorem  2.1]  that  for  a  finite 
value  of  a  (appearing  in  the  concave  exponential)  the 
smooth  problem  (8)  generates  an  exact  solution  of  the 
nonsmooth  problem  (6).  We  note  that  this  problem  is 
the  minimization  of  a  concave  objective  function  over 
a  polyhedral  set.  Even  though  it  is  difficult  to  find  a 
global  solution  to  this  problem,  a  fast  successive  linear 
approximation  (SLA)  algorithm  [5,  Algorithm  2.1]  ter¬ 
minates  finitely  (usually  in  5  to  7  steps)  at  a  stationary 
point  which  satisfies  the  minimum  principle  necessary 
optimality  condition  for  problem  (8)  [5,  Theorem  2.2] 
and  leads  to  a  sparse  w  with  good  generalization  prop¬ 
erties.  For  convenience  we  state  the  SLA  algorithm 
below. 

Algorithm  2.1 

Successive  Linearization  Algorithm  (SLA)  for 
FSV  (8).  Choose  X  €  [0,1).  Start  with  a  random 
(w°,  j°,  y°,  z°,  v°).  Having  (in®, 7^,2/% 2:% u®)  deter¬ 
mine  (u;®"^^,7®+^,2/®+^,2:®+^,i;®'*‘^)  hy  solving  the  linear 
program: 

(1  -  A)(^  +  4^)  +  Aa  (e— {v  -  v^) 

—Aw  +  e'y  e  <  y, 

Bw  —  e'y  +  e  <  z, 
y>0,z>0, 

—  v  <  W  <  V. 


Stop  when 

m  k 

Xa  ^  (u®+i  -  u®)  =  0.  (10) 

Comment:  The  parameter  a  was  set  to  5.  The  pa¬ 
rameter  X  was  chosen  to  “maximize”  generalization 
performance. 

We  have  found  useful  solutions  to  (8)  for  the  fixed 
value  a  =  5  [5,  4].  Another  approach,  involving  more 
computation,  is  to  solve  (8)  for  an  increasing  sequence 
of  a  values. 

3  SVM:  FEATURE  SELECTION 
VIA  SUPPORT  VECTOR 
MACHINES 

The  support  vector  machine  idea  [27,  1,  10,  24,  28], 
although  not  originally  intended  as  a  feature  selection 
tool,  does  in  fact  indirectly  suppress  components  of  the 
normal  vector  w  to  the  separating  plane  P  (1)  when 
an  appropriate  norm  is  used  for  measuring  the  dis¬ 
tance  between  the  two  parallel  bounding  planes  for  the 
sets  being  separated.  The  SVM  approach  consists  of 
adding  another  term,  to  the  objective  function  of 
the  RLP  (4)  in  a  similar  maimer  to  the  appended  term 
e^jwj*  of  problem  (5).  Here,  jj  ■  jj'  is  the  dual  of  some 
norm  on  f?"  used  to  measure  the  distance  between  the 
two  bounding  planes.  The  justification  for  this  term 
is  as  follows.  The  separating  plane  P  (1)  generated  by 
the  RLP  linear  program  (4)  lies  midway  between  the 
two  parallel  planes  w^x  =  7  -f-  1  and  w^x  =  7  —  1. 
The  distance,  measured  by  some  norm  jj  ■  jj  on  i?®®, 
between  these  planes  is  precisely  n^ip-  [20,  Theorem 
2.2].  The  appended  term  to  the  objective  function  of 
the  RLP  (4),  the  reciprocal  of  this  distance, 

thus  driving  the  distance  between  these  two  planes  up 
to  obtain  better  separation.  This  results  then  in  the 
following  mathematical  programming  formulation  for 
the  SVM  formulation: 

minimize  {1  —  X){e'^y  +  e'^z) -{■  ^\\w\\' 

-Aw  +  e'y  +  e  <y, 
subject  to  Bw  -  e-y  e  <  z, 

y  >0,z  >0. 

Points  Ai  £  A  and  Bi  £  B  appearing  in  active  con¬ 
straints  of  the  linear  program  (11)  with  positive  dual 


minimize 

w,‘y,y,z.,v 

subject  to 


minimize 

ly, 7,1/, 2,4; 

subject  to 


(9) 
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variables  constitute  the  support  vectors  of  the  prob¬ 
lem.  These  points  are  the  only  data  points  that  are 
relevant  for  determining  the  optimal  separating  plane. 
Their  number  is  usually  small  and  it  is  proportional  to 
the  generalization  error  of  the  classifier  [24]. 

If  we  use  the  1-norm  to  measure  the  distance  between 
the  planes,  then  the  dual  to  this  norm  is  the  oo-norm 
aind  accordingly  l|ii;l|'  =  ||ru||oo  in  (11)  which  leads  to 
the  following  linear  programming  formulation: 


minimize 


subject  to 


(1  -  \){e^y  -t-  e^z)  +  \v 

-Aw  -I-  67  +  e  <  2/, 
Bw  -  ey  +  e  <  z, 
—ev  <w<eu, 
y  >  0,2  >  0. 


(12) 


Similarly  if  we  use  the  oo-norm  to  measure  the  distance 
between  the  planes,  then  the  dual  to  this  norm  is  the  1- 
norm  and  accordingly  l|u;l|'  =  ||«;||iin(ll)  which  leads 
to  the  following  linear  programming  formulation: 

minimize  (1  -  X){e'^y  -f-  e'^z)  +  \e'^s 
-Aw  +  ey  +  e  <y, 

.  .  Bw  -  ey  +  e  <  z,  (1^) 

subject  to  .  ^  ~ 

•’  —s<w<s, 

y  >  0,  z  >  0. 

We  note  that  the  first  paper  on  the  multisurface 
method  on  pattern  separation  [17]  also  proposed  and 
implemented,  just  as  does  the  support  vector  machine 
approach,  forcing  the  two  parallel  planes  that  bound 
the  sets  to  be  separated  to  be  as  far  apart  as  possible. 

Usually  the  support  vector  machine  problem  is  formu¬ 
lated  using  the  2-norm  in  the  objective  [27,  1].  Since 
the  2-norm  is  dual  to  itself,  it  follows  that  the  dis¬ 
tance  between  the  parallel  planes  defining  the  separat¬ 
ing  surface  is  also  measured  in  the  2-norm  when  this 
formulation  is  used.  In  this  case  ||tt;||'  =  ||w||2,  and 

one  usually  appends  the  term  -||w||2  fo  the  objective 
of  (11)  resulting  in  the  following  quadratic  program: 


minimize  {1  —  X){e^y  +  z)  +  ^w^w 

-Aw  +  ey  +  e  <  y,  (14) 

subject  to  Bw  —  ey  +  e  <  z, 

y  >  0,  z  >  0. 

Nonlinear  separating  surfaces,  which  are  linear  in  their 
parameters,  can  also  easily  be  handled  by  the  formu¬ 
lations  (8),  (12)  and  (13)  [16j.  If  the  data  are  mapped 
nonlinearly  via  #  :  ii"  ->  a  nonlinear  separating 


surface  in  /?"  is  easily  computed  as  a  linear  separator 
in  R^.  In  practice,  one  usually  solves  (14)  by  way  of  its 
dual  [18].  In  this  formulation,  the  data  enter  only  as 
inner  products  which  are  computed  in  the  transformed 
space  via  a  kernel  function  K{x,y)  =  $(x)  ■  $(y) 
[6,  27,  28]. 

We  note  that  separation  errors  in  (12)  -  (14)  are 
weighted  equally  conforming  to  the  SVM  formulations 
in  [6,  27].  In  contrast,  the  formulations  (4)  and  (8) 
measure  average  separation  error.  Minimizing  average 
separation  error  in  (4)  ensures  that  the  solution  w  =  0 

fT  A  fT  ^ 

occurs  iff - =  — ,  in  which  case  it  is  not  unique 

m  k 
[2,  Theorem  2.5]. 

We  turn  our  attention  now  to  computational  testing 
and  comparison. 


4  COMPUTATIONAL  RESULTS 

4.1  DATA  SETS 

The  Wisconsin  Prognostic  Breast  Cancer  Database 
consists  of  198  instances  with  35  features  represent¬ 
ing  follow-up  data  for  one  breast  cancer  case  [23]. 

We  used  2  variants  of  this  data  set.  The  first  data  set 
was  created  where  the  elements  of  the  set  A  were  30 
nuclear  features  plus  diameter  of  excised  tumor  and 
number  of  positive  lymph  nodes  of  instances  corre¬ 
sponding  to  patients  in  which  cancer  had  recurred  in 
less  than  24  months  (28  points).  The  set  B  consisted 
of  the  same  features  for  patients  in  which  cancer  had 
not  recurred  in  less  than  24  months  (127  points).  The 
second  variant  of  the  data  set  consisted  of  the  same  32 
features,  but  but  splits  the  data  into  A  and  B  differ¬ 
ently.  Elements  of  A  corresponds  to  patients  with  a 
cancer  recurrence  in  less  than  60  months  (41  points) 
and  B  corresponds  to  patients  which  cancer  had  not 
recurred  in  less  than  60  months  (69  points). 

The  Johns  Hopkins  University  Ionosphere  data  set 
consists  of  34  continuous  features  of  351  instances  [23]. 
Each  instance  represents  a  radar  return  from  the  iono¬ 
sphere.  The  set  A  consists  of  225  radar  returns  termed 
“good”  or  showing  some  type  of  structure  in  the  iono¬ 
sphere.  The  set  B  consists  of  126  radar  returns  termed 
“bad”;  their  signals  pass  through  the  ionosphere. 

The  Cleveland  Heart  Disease  data  set  consists  of  297 
instance  with  13  features  (see  documentation  [23]).  Set 
A  consist  of  214  instance.  The  set  B  consists  of  83 
instances. 
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CoTOCtness  vs.  X  ('Tune',  Test] 


Figure  1:  Tuning  and  testing  sets  correctness  for  a  support 
vector  machine  (13)  versus  the  sparsity-inducing  parameter 
A  on  the  WPBC  (24  months)  data  set.  Dashed  =  “tuning” 
correctness,  Solid  =  test  correctness. 

The  Pima  Indians  Diabetes  data  set  consists  of  768 
instances  with  8  features  plus  a  class  label  (see  doc¬ 
umentation  [23]).  The  500  instances  with  class  label 
“0”  were  place  in  A,  the  268  instances  with  class  label 
“1”  were  placed  in  B. 

The  BUPA  Liver  Disorders  data  set  consists  of  345 
instances  with  6  features  plus  a  selector  field  used  to 
split  the  data  into  2  sets  (see  documentation  [23]).  Set 
A  consists  of  145  instances,  the  set  B  consists  of  200 
instances. 

4.2  EXPERIMENTAL  METHODOLOGY 

Our  goal  was  to  evaluate  the  generalization  ability  of 
the  classifiers  obtained  by  solving:  the  concave  mini¬ 
mization  problem  FSV  (8),  SVM  1-norm  problem  (13), 
the  SVM  oo-norm  problem  (12),  the  SVM  2-norm 
problem  (14)  and  the  robust  linear  program  (RLP) 
(4).  We  estimate  the  generalization  ability  of  a  classi¬ 
fier  via  10-fold  cross-validation  [26]. 

We  note  that  the  objective  function  parameter  A, 
which  can  induce  sparsity,  must  be  chosen  carefully 
to  maximize  the  generalization  ability  of  the  resulting 
classifier.  Choosing  A  =  0  will  maximize  the  training 
correctness  of  the  resulting  classifier,  but  often  this 
classifier  performs  poorly  on  data  not  in  the  train¬ 
ing  set  [25].  We  eifiploy  the  following  “tuning  set” 
procedure  for  choosing  A  at  each  fold  of  10-fold  cross- 
validation:  For  each  A  in  a  candidate  set  A,  we  perform 
the  following:  {i)  set  aside  10%  of  the  training  data  as 


a  “tuning”  set,  {ii)  obtain  a  classifier  for  the  given 
value  of  A,  {Hi)  determine  correctness  on  the  “tuning” 
set,  {iv)  repeat  steps  {i)-{iii)  ten  times,  each  time  set¬ 
ting  aside  a  different  10%  portion  of  the  training  data. 
The  “score”  for  this  value  of  A  is  the  average  of  the  10 
correctness  values  determined  in  (Hi). 

We  fix  the  value  of  A  as  that  with  the  best  “score”  de¬ 
termined  from  the  tuning  procedure  (ties  are  broken  by 
choosing  the  smallest  A- value).  This  is  the  value  used 
for  the  given  fold  of  10-fold  cross-validation.  The  set  A 
is  a  set  of  candidate  values  and  for  these  experiments 
was  set  at:  A  =  {0.05,0.10, 0.20, ...  ,0.90,0.95}.  The 
curves  in  Figure  1  indicate  that  the  value  of  A  that 
maximizes  the  “tuning”  score  (dashed  curve  in  Figure 
1)  is  a  good  estimate  of  the  value  of  A  that  maximizes 
the  test  set  correctness  (solid  curve). 

4.3  EXPERIMENTAL  RESULTS 

Table  1  summarizes  the  average  number  of  original 
problem  features  selected  by  the  classifiers  trained  by 
each  of  the  methods. 

Table  2  summarizes  the  results  of  the  10-fold  cross- 
validation  experiments  on  6  real-world  data  sets.  All 
“Train”  and  “Test”  numbers  presented  are  average 
correctnesses  over  10-folds.  The  p-value  is  an  indicator 
of  significance  difference  in  “Test”  correctness  between 
the  classifiers  obtained  by  solving  FSV  (8)  and  the 
classifiers  obtained  by  solving  the  SVM  1-norm  prob¬ 
lem  (13)  Recall  that  a  high  p- value  indicates  that 
the  difference  is  not  significant.  We  note  that  p-values 
were  not  calculated  for  the  other  pairwise  comparisons 
because  the  solutions  obtained  by  solving  the  SVM 
oo-norm,  SVM  2-norm  and  the  RLP  did  not  suppress 
problem  features  (see  Table  1). 

4.4  DISCUSSION 

The  FSV  (8)  and  the  SVM  1-norm  (13)  problems 
where  the  only  ones  exhibiting  feature  selection  (Ta¬ 
ble  1).  On  the  6  data  sets  tested,  the  SVM  1-norm 
classifiers  performed  slightly  better  on  3  data  sets  and 
FSV  classifiers  performed  slightly  better  on  3  data  sets. 
The  minimum  p- value  is  0.1246  indicates  that  classi¬ 
fiers  obtained  by  the  FSV  (8)  and  the  SVM  1-norm 
(13)  methods  have  similar  generalization  properties. 
Applying  the  paired  t-test  to  10-fold  cross  validation 
results  may  indicate  a  difference  in  the  average  test 

^Specifically,  this  is  the  p- value  of  a  two-tailed  paired 
t-test  testing  the  hypothesis  that  the  difference  in  “Test” 
correctnesses  for  the  FSV  and  SVM  1-norm  classifiers  is 
zero 
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set  correctness  when  one  is  not  present  [9].  Thus  the 
results  of  these  experiments  may  be  more  similar  than 
indicated  by  the  p- values. 

We  note  that  the  classifiers  obtained  by  solving  the 
SVM  oo-norm  (12)  suppressed  none  of  the  original 
problem  features  for  all  but  the  largest  values  of  A 
(near  1.0),  which  in  general  is  of  little  use  because  it 
is  often  accompanied  by  poor  set  separation.  Simi¬ 
lar  behavior  was  observed  by  solving  the  SVM  2-norm 
(14)  problem.  Note  that  the  oo-norm  is  sensitive  to 
outliers,  as  is  the  2-norm  squared. 

The  classifiers  obtained  by  solving  the  FSV  problem 
(8)  selected  fewer  problem  features  than  the  any  of  the 
SVM  formulations  (12),  (13),  (14)  and  the  RLP  (4) 
FSV  classifiers  reduced  the  number  of  features  used 
over  SVM  1-norm  by  as  much  as  39.5%  (WPBC  60 
month),  while  maintaining  comparable  generalization 
performance. 

On  the  WPBC  24  month  dataset,  both  the  FSV  clas¬ 
sifiers  (8)  and  the  SVM  1-norm  classifiers  (13)  most 
often  selected  a  nuclear  area  feature  and  number  of 
lymph  nodes  removed  from  the  patient.  These  fea¬ 
tures  are  deemed  relevant  to  the  prognosis  problem. 

All  linear  programs  formulations  were  solved  using  the 
CPLEX  package  [8]  called  from  within  MATLAB  [22]. 
The  quadratic  programming  problem  (14)  was  solved 
using  MATLAB ’s  quadratic  optimization  solver,  which 
encountered  difficulty  on  conditioning  the  QP  con¬ 
straint  matrix,  which  may  affect  the  interpretation  of 
the  results  for  this  approach.  See  Table  3  for  average 
solve  times. 

5  SUMMARY  AND  FUTURE 
WORK 

Computational  comparisons  of  classifiers  obtained  by 
solving  four  mathematical  optimization  problems  are 
presented.  The  optimization  formulations  are  either 
linear  (4),  (12)  and  (13),  or  quadratic  (14),  or  can  be 
solved  by  a  finite  sequence  of  linear  programs  (solv¬ 
ing  (8)  via  Algorithm  2.1).  Classifiers  obtained 
by  solving  the  FSV  problem  (8)  and  the  SVM 
1-norm  problem  (13)  exhibit  feature  suppres¬ 
sion  and  have  comparable  generalization  per¬ 
formance  on  six  publicly  available  real  world 
data  sets  tested.  The  classifiers  obtained  by 
solving  the  FSV  problem  (8)  suppressed  more 
features  than  the  corresponding  SVM  1-norm 
classifiers  (13).  The  quadratic  SVM  (14)  took 
orders  of  magnitude  more  time  than  the  linear- 


programming-based  SVMs  (12)  and  (13). 

When  the  distance  between  the  2  parallel  planes  defin¬ 
ing  the  separating  surface  in  the  SVM  problem  is  cho¬ 
sen  to  be  the  1-norm,  the  resulting  SVM  optimization 
problem  has  the  oo-norm  (dual  norm  to  the  1-norm) 
appearing  in  the  objective.  The  classifiers  obtained  by 
solving  this  problem  (SVM  oo-norm  (12))  did  not  ex¬ 
hibit  feature  selection.  Similar  behavior  was  observed 
for  classifiers  obtained  by  solving  the  SVM  2-norm  (14) 
problem.  The  generalization  ability  of  those  classifiers 
in  comparison  with  the  others  presented  needs  to  be 
further  investigated. 

Future  work  includes  further  analysis  of  the  benefits 
of  measuring  the  distance  between  the  bounding  par¬ 
allel  planes  defining  the  separating  plane  and  the  re¬ 
sulting  optimization  problem  utilizing  the  dual  norm 
(11).  A  characterization  of  classes  of  data  sets  which 
lend  themselves  to  better  separation  with  the  choice 
of  one  norm  over  another  will  allow  practitioners  to 
choose  a  priori  an  optimization  formulation  believed 
to  be  “best”  suited  to  the  separation  problem  at  hand. 
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Abstract 

Practical  approaches  to  clustering  use  an  iterative 
procedure  (e.g.  K-Means,  EM)  which  converges 
to  one  of  numerous  local  minima.  It  is  known 
that  these  iterative  techniques  are  especially 
sensitive  to  initial  starting  conditions.  We 
present  a  procedure  for  computing  a  refined 
starting  condition  from  a  given  initial  one  that  is 
based  on  an  efficient  technique  for  estimating  the 
modes  of  a  distribution.  The  refined  initial 
starting  condition  allows  the  iterative  algorithm 
to  converge  to  a  “better”  local  minimum.  The 
procedure  is  applicable  to  a  wide  class  of 
clustering  algorithms  for  both  discrete  and 
continuous  data.  We  demonstrate  the  application 
of  this  method  to  the  popular  K-Means  clustering 
algorithm  and  show  that  refined  initial  starting 
points  indeed  lead  to  improved  solutions. 
Refinement  run  time  is  considerably  lower  than 
the  time  required  to  cluster  the  full  database. 

The  method  is  scalable  and  can  be  coupled  with 
a  scalable  clustering  algorithm  to  address  the 
large-scale  clustering  problems  in  data  mining. 

1.  BACKGROUND 

Clustering  is  an  important  area  of  application  for  a  variety 
of  fields  including  data  mining  [FPSU96],  statistical  data 
analysis  [KR89,BR93],  compression  [ZRL97],  and  vector 
quantization.  Clustering  has  been  formulated  in  various 
ways  in  the  machine  learning  [F87],  pattern  recognition 
[DH73,F90],  optimization  [BMS97,SI84],  and  statistics 
literature  [KR89,BR93,B95,S92,S86].  The  fundamental 
clustering  problem  is  that  of  grouping  together 
(clustering)  data  items  which  are  similar  to  each  other. 
The  most  general  approach  to  clustering  is  to  view  it  as  a 
density  estimation  problem  [S86,  S92,BR93].  We  assume 
that  in  addition  to  the  observed  variables  for  each  data 
item,  there  is  a  hidden,  unobserved  variable  indicating  the 
“cluster  membership”  of  the  given  data  item.  Hence  the 
data  is  assumed  to  arrive  from  a  mixture  model  and  the 
mixing  labels  (cluster  identifiers)  are  hidden.  In  general,  a 
mixture  model  M  having  K  clusters  C,.  i=l,...,K,  assigns 
a  probability  to  a  data  point  x  as  follows: 
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K 

Pr(A:  I  M )  =  W;  •  Pr(x  I  C; ,  M )  where  W;  are  called  the 
(=1 

mixture  weights.  Many  methods  assume  that  the  number 
of  clusters  K  is  known  or  given  as  input. 

The  clustering  optimization  problem  is  that  of  finding 
parameters  associated  with  the  mixture  model  M  (W,  and 
parameters  of  components  C,)  which  maximize  the 
likelihood  of  the  data  given  the  model.  The  probability 
distribution  specified  by  each  cluster  can  take  any  form. 
The  EM  algorithm  [DLR77,  CS96]  is  a  well-known 
technique  for  estimating  the  parameters  in  the  general 
case.  K-Means  clustering  is  a  popular  method  (historically 
also  known  as  Forgy’s  method  [F65]  or  MacQueen’s 
algorithm  [M67]).  It  is  really  a  special  case  of  EM  that 
assumes: 

1)  Each  cluster  is  modeled  by  a  spherical  Gaussian 
distribution; 

2)  Each  data  item  is  assigned  to  a  single  cluster; 

3)  The  mixture  weights  (W,)  are  equal. 

Note  that  K-Means  [DH73,F90]  is  defined  over  numeric 
(continuous-valued)  data  since  it  requires  the  ability  to 
compute  the  mean.  A  discrete  version  of  K-Means  exists 
and  is  sometimes  referred  to  as  harsh  EM  [NH98].  The  K- 
Means  algorithm  finds  locally  optimal  solutions 
minimizing  the  sum  of  the  L2  distance  squared  between 
each  data  point  and  its  nearest  cluster  center  (“distortion”) 
[BMS97,SI84],  which  is  equivalent  to  a  maximizing  the 
likelihood  given  the  assumptions  listed  above. 

There  are  various  approaches  to  solving  the  problem  of 
determining  (locally)  optimal  values  of  the  parameters 
given  the  data.  Iterative  refinement  approaches,  which 
include  EM  and  K-Means,  are  the  most  effective.  The 
basic  algorithm  works  as  follows: 

1)  Initialize  the  model  parameters  to  a  current  model; 

2)  Decide  memberships  of  the  data  items  to  clusters, 
assuming  that  the  current  model  is  correct; 

3)  Re-estimate  the  parameters  of  the  current  model 
assuming  that  the  data  memberships  obtained  in  2) 
are  correct,  producing  new  model; 

4)  If  current  model  and  new  model  are  sufficiently  close 
to  each  other,  terminate,  else  go  to  2). 
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Figure  1 .  Two  Gaussian  bumps 

We  focus  on  the  initialization  step  1.  Given  the  initial 
condition  of  step  1,  the  algorithms  define  a  deterministic 
mapping  from  initial  point  to  solution.  Both  the  K-Means 
and  EM  algorithms  converge  finitely  to  a  point  (set  of 
parameter  values)  that  is  locally  maximal  for  the 
likelihood  of  the  data  given  the  model.  The  deterministic 
mapping  means  the  locally  optimal  solution  is  sensitive  to 
the  initial  point  choice. 

There  is  little  prior  work  on  initialization  methods  for 
clustering.  According  to  [DH73]  (p.  228): 

"One  question  that  plagues  all  hill-climbing 
procedures  is  the  choice  of  the  startinjg  point. 
Unfortunately,  there  is  no  simple,  universally 
good  solution  to  this  problem." 

"Repetition  with  different  random  selections"  [DH73] 
appears  to  be  the  defacto  method.  Most  presentations  do 
not  address  the  issue  of  initialization  or  assume  either 
user-provided  or  randomly  chosen  starting  points  [DH73, 
R92,  KR89].  A  recursive  method  for  initializing  the 
means  by  running  K  clustering  problems  is  mentioned  in 
[DH73].  A  variant  of  this  method  consists  of  taking  the 
mean  of  the  entire  data  and  then  randomly  perturbing  it  K 
times  [TMCH97].  This  method  does  not  appear  to  be 
better  than  random  initialization  in  the  case  of  EM  over 
discrete  data  [MH98].  In  [BMS97],  the  values  of  initial 
means  along  any  one  of  the  d  coordinate  axes  is 
determined  by  selecting  the  K  densest  "bins"  along  that 
coordinate. 

Methods  to  initialize  EM  include  K-Means  solutions, 
hierarchical  agglomerative  clustering  (HAC)  [DH73,R92, 
MH98]  and  “marginal+noise”  [TMCH97].  It  was  found 
that  for  EM  over  discrete  data  initialized  with  either  HAC 
or  “marginal+noise”  showed  no  improvement  over 
random  initialization  [MH98]. 

For  the  remainder  of  this  paper  we  focus  on  the  K-Means 
algorithm  although  the  method  can  refine  an  initial  point 
for  other  clustering  algorithms.  Our  focus  on  K-Means  is 
justified  by  the  following:  1)  it  is  a  standard  technique  for 
clustering,  used  in  a  wide  array  of  applications  and  even 
as  way  to  initialize  the  more  expensive  EM  clustering 
algorithm  [B95,  CS96,  MH98];  2)  regardless  of  which 
clustering  algorithm  is  being  used,  K-Means  is  employed 
internally  by  our  initialization  refinement  method;  3)  the 
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in  2-d:  full  sample  versus  small  subsample. 

purpose  of  the  paper  is  to  illustrate  the  refinement 
procedure,  not  to  evaluate  a  variety  of  clustering 
algorithms. 

2.  REFINING  INITIAL  CONDITIONS 

We  address  the  problem  of  initializing  a  general 
clustering  algorithm,  but  limit  our  presentation  of  results 
to  K-Means.  Since  no  good  method  for  initialization 
exists  [MH98],  we  compare  against  the  defacto  standard 
method  for  initialization:  randomly  choosing  an  initial 
starting  point.  However,  the  method  can  be  applied  to  any 
starting  point  provided. 

A  solution  of  the  elustering  problem  is  a  parameterization 
of  each  cluster  model.  This  parameterization  can  be 
performed  by  determining  the  modes  (maxima)  of  the 
joint  probability  density  of  the  data  and  placing  a  cluster 
centroid  at  each  mode.  Hence  one  clustering  approach  is 
to  estimate  the  density  and  attempt  to  find  the  maxima 
(“bumps”)  of  the  estimated  density  function.  Density 
estimation  in  high  dimensions  is  difficult  [S92],  as  is 
bump  hunting  IF90].  We  propose  a  method,  inspired  by 
this  procedure  that  refines  the  initial  point  to  a  point  likely 
to  be  closer  to  the  modes.  The  challenge  is  to  perform 
refinement  efficiently. 

The  basic  heuristic  is  that  severely  subsampling  the  data 
will  naturally  bias  the  sample  to  representatives  “near”  the 
modes.  In  general,  one  cannot  guard  against  the 
possibility  of  points  from  the  tails  appearing  in  the 
subsample.  We  have  to  overcome  the  problem  that  the 
estimate  is  fairly  unstable  due  to  elements  of  the  tails 
appearing  in  the  sample.  Figure  1  shows  data  drawn  from 
a  mixture  of  two  Gaussians  (clusters)  in  2-D  with  means 
at  [3,2]  and  [5,5].  On  the  left  is  the  full  data  set,  on  the 
right  a  small  subsample  is  shown,  providing  information 
on  the  modes  of  the  joint  probability  density  function. 
Each  of  the  points  on  the  right  may  be  thought  of  as  a 
“guess”  at  the  possible  location  of  a  mode  in  the 
underlying  distribution.  The  estimates  are  fairly  varied, 
but  they  certainly  exhibit  “expected”  behavior.  Worthy  of 
note  here  is  that  good  separation  between  the  two  clusters 
is  achieved.  This  observation  indicates  that  the  solutions 
obtained  by  clustering  over  a  small  subsample  may 
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Figure  2:  Result  of  clustering  two  different  samples  drawn /rom  the  same  distribution,  and 
initialized  with  the  same  starting  point  (produced  solution  indicated  by  ‘+’). 


provide  good  refined  initial  estimates  of  the  true  means, 
or  centroids,  in  the  data.  However,  this  method  often 
produces  noisy  estimates  due  to  single  small  subsamples, 
especially  in  skewed  distributions  and  high  dimensions 
(Figure  2).  This  behavior  is  fairly  common  when 
clustering  over  small  subsamples.  In  fact  it  is  surprisingly 
frequent  even  in  low  dimensions  using  data  from  well- 
separated  Gaussians'.  Figure  2  can  also  be  used  to 
illustrate  the  importance  of  the  problem  of  having  a  good 
initial  points.  An  initial  cluster  center  attracting  no  data 
may  remain  empty  (Figure  2,  left),  while  a  starting  point 
with  no  empty  clusters  usually  produces  better  solutions 
(right). 


Figure  3.  Multiple  Solutions  from  Multiple  Samples. 
2.1  Clustering  Clusters 


In  order  to  overcome  the  problem  of  noisy  estimates,  we 
employ  the  following  procedure.  Multiple  subsamples, 
say  J,  are  drawn  and  clustered  independently  producing  J 
estimates  of  the  true  cluster  locations.  To  avoid  the  noise 
associated  with  each  of  the  J  solutions,  we  employ  a 
“smoothing”  procedure.  However,  to  “best”  perform  this 
smoothing,  one  needs  to  solve  the  problem  of  grouping 
the  K*J  points  (7  solutions,  each  having  K  clusters)  into  K 
groups  in  an  “optimal”  fashion.  Figure  3  shows  4 

'  In  fact  data  from  well-separated  Gaussians  in  low-D  are  a  “best-case” 
scenario  for  the  behavior  of  a  random  sampling  based  approach.  Note 
the  idealized  conditions:  no  noise,  algorithm  given  the  correct  number 
of  clusters  K.  With  real-wold  data  ideal  conditons  are  difficult  to 
achieve,  hence  the  behavior  is  expected  to  be  worse  (and  indeed  it  is). 


solutions  obtained  for  K=3,  7=4.  The  “true”  cluster  means 
are  depicted  by  “X”.  The  A’s  show  the  3  points  obtained 
from  the  first  subsample,  B’s  second,  C’s  third,  and  D’s 
fourth.  The  problem  then  is  determining  that  D1  is  to  be 
grouped  with  A1  but  A2  should  not  be  grouped  with  {Al, 
B1,C1,D1}. 

2.2  The  Refinement  Algorithm 

The  refinement  algorithm  initially  chooses  7  small 
random  sub-samples  of  the  data,  i=l,...,J.  The  sub¬ 
samples  are  clustered  via  K-Means  with  the  proviso  that 
empty  clusters  at  termination  will  have  their  initial  centers 
re-assigned  and  the  sub-sample  will  be  re-clustered.  The 
sets  CM, ,  i= are  these  clustering  solutions  over  the 
sub-samples  which  form  the  set  CM.  CM  is  then  clustered 
via  K-Means  initialized  with  CMj  producing  a  solution 
FM,.  The  refined  initial  point  is  then  chosen  as  the  FM, 
having  minimal  distortion  over  the  set  CM. 

Clustering  CM  is  a  smoothing  over  the  CM,  to  avoid 
solutions  “corrupted”  by  outliers  included  in  the  sub¬ 
sample  5,.  The  refinement  algorithm  takes  as  input:  SP 
(initial  starting  point).  Data,  K,  and  7  (number  of  small 
subsamples  to  be  taken  from  Data)'. 

Algorithm  Refine(  SP,  Data,  K,  7) 

0.  CM  =  if 

1.  Fori=l,...,7 

a.  Let  Si  be  a  small  random  subsample  of 
Data 

b.  Let  CMj  =  KMeansMod(5F,  5„  K) 

c.  CM=CMkj  CMj 

2.  FM5  =  (t) 

3.  Fori=l,...,7 

a.  Let  FMj  =  KMeans(CMi,  CM,  K) 

b.  Let  FMS  =  FMS  ^  FMi 

4.  LetFM=  ArgMin{Distortion{FM ^,CM)} 

FMj 

5.  Return  (FM) 
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We  define  the  following  functions 
called  by  the  refinement  algorithm: 
KMeans(  ),  KMeansMod(  )  and 
Distortion(  ).  KMeans  is  simply  a 

call  to  the  classic  K-Means 

algorithm  taking:  an  initial  starting 
point,  dataset  and  the  number  of 
clusters  K,  returning  a  set  of  K  d- 
dimensional  vectors,  the  estimates 
of  the  centroids  of  the  K  clusters. 
KmeansMod  takes  the  same 
arguments  as  KMeans  (above)  and 
performs  the  same  iterative 

procedure  as  classic  K-Means 

except  for  the  following  slight 
modification.  When  classic  K- 
Means  has  converged,  the  K  clusters 
are  checked  for  membership.  If  any 
of  the  K  clusters  have  no 


Cluster  multiple 
subsamples 


Multiple  Sample 
Solutions 


Cluster  Solutions 
(multiple  starts) 


Figure  4.  The  Starting  Point  Refinement  Procedure 


membership  (which  often  happens  when  clustering  over 
small  subsamples),  the  corresponding  initial  estimates  of 
the  empty  cluster  centroids  are  set  to  data  elements  which 
are  farthest  from  their  assigned  cluster  center,  and  classic 
K-Means  is  called  again  from  these  new  initial  centriods. 

The  heuristic  re-assignment  is  motivated  by  the 
following:  if,  at  termination  of  K-Means,  there  are  empty 
clusters  then  reassigning  all  empty  clusters  to  points 
farthest  from  their  respective  centers  decreases  distortion 
most  at  this  step.  An  example  of  clusters  having  zero 
membership  is  depicted  in  Figure  3  (left). 

Distortion  takes  set  of  K  estimates  of  the  means  and  the 
data  set  and  computes  the  sum  of  squared  distances  of 
each  data  point  to  its  nearest  mean.  This  scalar  measures 
the  degree  of  fit  of  a  set  of  clusters  to  the  dataset.  The  K- 
Means  algorithm  terminates  at  a  solution  which  is  locally 
optimal  for  this  distortion  function  [SI84].  The  refinement 
process  is  illustrated  in  the  diagram  of  Figure  4. 

2.3  Computational  Complexity  and  Scalability  to 
Large  Databases 


database  [BFR98].  Scalable  clustering  methods 
obviously  benefit  from  better  initialization. 

Since  our  method  works  on  very  small  samples  of  the 
data,  the  initialization  is  fast.  For  example,  if  we  use 
sample  sizes  of  1%  (or  less)  of  the  full  dataset  size,  trials 
over  10  samples  can  be  run  in  time  complexity  that  is  less 
than  10%  of  the  time  needed  for  clustering  the  full 
database.  For  very  large  databases,  the  initial  sample 
becomes  negligible  in  size. 

If,  for  a  data  set  D,  a  clustering  algorithm  requires  Iter(D) 
iterations  to  cluster  it,  then  time  complexity  is  IDI  * 
Iter(D).  A  small  subsample  S  c  D,  where  ISI  <<  IDI, 
typically  requires  significantly  fewer  iteration  to  cluster. 
Empirically,  it  is  safe  to  expect  that  Iter(S)  <  Iter(D). 
Hence,  given  a  specified  budget  of  time  that  a  user 
allocates  to  the  refinement  process,  we  simply  determine 
the  number  J  of  subsamples  to  use  in  the  refinement 
process.  When  IDI  is  very  large,  and  151  is  a  small 
proportion  of  IDI,  refinement  time  is  essentially 
negligible,  even  for  large  J. 


The  refinement  algorithm  is  primarily  intended  to  work 
on  large  databases.  When  working  over  small  datasets 
(e.g.  most  data  sets  in  the  Irvine  Repository),  applying  the 
classic  K-Means  algorithm  from  many  different  starting 
points  is  a  feasible  option.  However,  as  database  size 
increases  (especially  in  dimensionality),  efficient  and 
accurate  initialization  becomes  critical.  A  clustering 
session  on  a  data  set  with  many  dimensions  and  tens  of 
thousands  or  millions  of  records  can  take  hours  to  days. 
In  [BFR98],  we  present  a  method  for  scaling  clustering  to 
very  large  databases,  specifically  targeted  at  databases  not 
fitting  in  RAM.  We  show  that  accurate  clustering  can  be 
achieved  with  improved  results  over  classic  K-Means 
applied  to  an  appropriately  sized  random  subsample  of  the 


Another  desirable  property  of  the  refinement  algorithm  is 
that  it  easily  scales  to  very  large  databases.  The  only 
memory  requirement  is  to  hold  a  small  subsample  in 
RAM.  In  the  secondary  clustering  stage,  only  the 
solutions  obtained  from  the  J  subsamples  need  to  be  held 
in  RAM. 

Note  we  assume  that  it  is  possible  to  obtain  a  random 
sample  from  a  large-scale  database.  While  this  sounds 
simple,  in  reality  this  can  be  a  challenging  task.  Unless 
one  can  guarantee  that  the  records  in  a  database  are  not 
ordered  by  some  property,  random  sampling  can  be  as 
expensive  as  scanning  the  entire  database  (using  some 
scheme  such  as  reservoir  sampling,  e.g.  [J62]).  Note  that 
in  a  database  environment,  what  one  thinks  of  as  a  data 
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Figure  5:  Left:  K-Mean  solution  (large  red  circles)  from  random  initial  point  (blue  squares).  Right: 
Refined  initial  point  (red  circles),  random  initial  point  (blue  squares). 


table  (a  view)  may  not  exist  as  a  physical  table.  The  result 
of  a  query  may  involve  joins,  groupings,  and  sorts.  In 
many  cases  database  operations  impose  a  special  ordering 
on  the  result  set,  and  “randomness”  of  the  resulting 
database  view  cannot  be  assumed  in  general. 

2.4  An  Example 

Figure  5  illustrates  the  sensitivity  of  K-Means  solutions  to 
initial  conditions.  Elements  are  sampled  from  three 
Gaussians  in  2  dimensions.  Note  that  the  Gaussians  in 
this  case  happen  to  be  centered  along  a  diagonal.  The 
reason  for  this  choice  is  that  even  as  the  dimensionality  of 
the  data  goes  higher,  any  2  dimensional  projection  of  the 
higher  dimensional  data  will  have  this  same  form,  making 
the  data  set  easy  for  a  visualization-based  approach. 
Simply  project  the  data  to  2  dimensions,  and  the  clusters 
reveal  themselves.  This  is  a  rare  property  since,  if  the 
Gaussians  are  not  aligned  along  the  diagonal,  any  lower¬ 
dimensional  projection  may  result  in  overlaps  and 
separability  in  2  dimensions  is  lost.  The  left  figure  shows 
a  random  starting  point  and  the  corresponding  K-Means 
solution.  The  right  figure  shows  the  same  initial  random 
points  and  the  result  of  the  refinement  procedure  on  this 
random  initial  point.  Note  that  in  this  case  the  refined 
point  is  very  close  to  the  true  solution.  Running  K-Means 
from  the  refined  point  converges  to  the  true  solution. 

It  is  important  to  point  out  that  this  example  is  for 
illustrative  purposes  only.  The  interesting  cases  are  high¬ 
dimensional  data  sets  with  more  data  items. 
Computational  results  indicate  that  the  refinement  method 
scales  well  to  higher  dimensions  (100-D  and  more). 

3.  RESULTS  ON  SYNTHETIC  DATA 

3.1  Data  Set  Description 


Synthetic  data  was  created  for  dimension  d  =  2,  3,  4,  5, 
10,  20,  40,  50  and  100.  For  a  given  value  of  d,  data  was 
sampled  from  10  Gaussians  (hence  K=10)  with  elements 
of  their  mean  vectors  (the  true  means)  fi  sampled  from  a 
uniform  distribution  on  [-5,5].  Elements  of  the  diagonal 
covariance  matrices  Z  were  sampled  from  a  uniform 
distribution  on  [0.7, 1.5].  The  number  of  data  points 
sampled  was  chosen  as  20  times  the  number  of 
parameters  estimated  by  K-Means.  The  K=10  Gaussians 
were  not  evenly  weighted. 

3.2  Experimental  Methodology 

The  goal  of  this  experiment  is  to  evaluate  how  close  the 
means  estimated  by  classic  K-Means  are  to  the  true 
Gaussian  means  generating  the  synthetic  data.  We 
compare  3  initializations: 

1.  No  Refinement:  random  starting  point  chosen 
uniformly  on  the  range  of  the  data. 

2.  Refinement  (J=10):  a  starting  point  refined  from  (1) 
using  our  method.  The  size  of  the  random  subsamples 
being  10%  of  full  dataset  size  and  the  number  of 
subsamples  taken  being  10. 

3.  Refinement  (J=l):  same  as  2  but  over  a  single 
random  subsample  of  size  10%. 

Once  classic  K-Means  has  computed  a  solution  over  the 
full  dataset  from  any  of  the  3  initial  points  described 
above,  the  estimated  means  must  be  matched  with  the  true 
Gaussian  means  in  some  optimal  way  prior  to  computing 
the  distance  between  these  estimated  means  the  true 

Gaussian  means.  Let  fi\l  =  l . K  be  the  K  true 

Gaussian  means  and  let  3c^ ,  /  =  1, . . . ,  K  be  the  K  means 
estimated  by  classic  K-Means  over  the  full  dataset.  A 
“permutation”  71  is  determined  so  that  the  following 
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Figure  6.  Comparing  performance  as  dimensionality  increases 


quantity  is  minimized: 


The  “score”  for 


a  solution  computed  by  classic  K-Means  over  the  full 
dataset  is  simply  the  above  quantity  divided  by  K.  This  is 
the  average  distance  between  the  true  Gaussian  means  and 
those  estimated  by  K-Means  over  the  full  dataset  from  a 
given  initial  starting  point. 


3.3  Experimental  Results 

Figure  6  summarizes  results  averaged  over  10  random 
initial  points  determined  uniformly  on  the  range  of  the 
data.  Note  that  the  K-Means  solution  computed  from 
“Refined  (J=10)”,  is  consistently  nearer  to  the  true 
Gaussian  means  generating  the  dataset  than  the  K-Means 
solution  computed  from  either  the  random  initial  point  or 
the  “Refined  (J=l)”  initial  point.  On  the  left  we 
summarize  ratios  of  average  distance  to  the  true  Gaussian 
means  relative  to  the  average  distance  for  the  classic  K- 
Mean  solution  computed  from  the  refined  initial  point. 
Worthy  of  note  in  these  results  are  the  following  facts: 

1.  For  dimensions  2-50,  the  refinement  method  (Refined 
(J=10))  always  did  better  than  the  random  starting 
point  (Unrefined)  and  the  point  refined  over  1 
subsample  (Refined  (J=l)). 

2.  For  dimension  100,  in  9  of  the  10  independent  trials 
our  refinement  method  did  better  than  the  random 
starting  point. 

3.  Refiner  solutions  are  between  2.34  {d-3)  and  6.44 
(£(=50)  times  closer  to  the  true  Gaussian  means  than 
solutions  from  the  random  initial  point  and  between 
1.09  {d=3)  and  4.80  (£?=50)  times  closer  than  solution 
computed  from  “Refined  (J=l)”  initial  point. 

In  one  run,  we  did  slightly  worse.  This  explains  the  large 
variance  number  for  100  dimensions.  If  we  exclude  that 
one  data  point,  the  variance  drops  to  within  range  of  all 
other  dimensions.  The  fact  that  the  minimum  ratio  occurs 
for  datasets  with  small  dimensionality  and  the  maximum 
ratio  occurs  for  datasets  with  large  dimensionality 
indicates  the  utility  of  the  refinement  algorithm  for  large¬ 
dimensional  datasets. 


4.  RESULTS  ON  REAL-WORLD  DATA 

We  present  computational  results  on  2  classes  of  publicly 
available  “real-world”  datasets.  We  are  primarily  more 
interested  in  large  databases  —  hundreds  of  dimensions 
and  tens  of  thousands  to  millions  or  records.  It  is  for 
these  data  sets  that  our  method  exhibits  the  greatest  value. 
The  reason  is  simple:  a  clustering  session  on  a  large 
database  is  a  time-consuming  affair.  Hence  a  refined 
starting  condition  can  insure  that  the  time  investment  pays 
off. 

To  illustrate  this,  we  used  a  large  publicly  available  data 
set  available  from  Reuters  News  Service.  The  data  is 
described  in  Section  4.2.  We  also  wanted  to  demonstrate 
the  refinement  procedure  using  data  sets  from  the  UCI 
Machine  Learning  Repository.  For  the  most  part,  we 
found  that  these  data  sets  are  too  easy,  they  are  low 
dimensional  and  have  a  very  small  number  of  records. 
With  a  small  number  of  records,  it  is  feasible  to  perform 
multiple  restarts  efficiently.  Since  the  sample  size  is  small 
to  begin  with,  sub-sampling  for  initialization  is  not 
effective.  Hence  most  of  these  data  sets  are  not  of  interest 
to  us.  Nevertheless,  we  report  on  our  general  experience 
with  them  as  well  as  detailed  experience  with  one  of  these 
data  sets  to  illustrate  that  the  method  we  advocate  is 
useful  when  applied  to  smaller  data  sets.  We  emphasize, 
however,  that  our  refinement  procedure  is  best  suited  for 
large-scale  data.  The  refinement  algorithm  operates  over 
small  sub-samples  of  the  database  and  hence  run-times 
needed  to  determine  a  “good”  initial  starting  point  (which 
speeds  the  convergence  on  the  full  data  set)  are  orders  of 
magnitude  less  than  the  total  time  needed  for  clustering  in 
a  large-scale  situation. 

We  note  that  it  is  very  likely  that  the  cluster  labeling 
associated  with  many  real-world  databases  do  not 
correspond  to  the  distortion  measure  minimized  by  K- 
Means. 

4.1  Datasets  from  UCI  ML  Repository 
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We  evaluated  our  method  on  several  Irvine  data  sets.  First 
we  present  results  on  the  Image  Segmentation  data  set, 
then  we  discuss  the  results  over  the  other  data  sets. 


Image  Segmentation  Data  Set 

This  data  set  consists  of  2310  data  elements  in  19 
dimensions.  Instances  are  drawn  randomly  from  a 
database  of  7  outdoor  images  (brickface,  sky,  foliage, 
cement,  window,  path,  grass).  Each  of  the  7  images  is 
represented  by  330  instances.^ 

Experimental  Methodology 


Random  initial  starting  points  were  computed  by 
sampling  uniformly  over  the  range  of  the  data.  We 
compare  solutions  achieved  by  the  classic  K-Means 
algorithm  starting  from:  1)  random  initial  starting  points, 
and  2)  initial  points  refined  by  our  method. 


Once  classic  K-Means  has  converged,  the  “quality”  of  the 
solution  must  be  determined.  Unlike  the  case  of  synthetic 
data,  we  cannot  measure  distance  to  the  true  solution 
since  “truth”  is  not  known.  However,  we  can  use  average 
class  purity  within  each  cluster  as  one  measure  of  quality. 
The  other  measure,  which  is  not  dependent  on  a 
classification,  is  the  distortion  of  the  data  given  the 
clusters.  Quality  scoring  methods  are: 

Information  Gain:  estimates  the  “amount  of  information” 
gained  by  clustering  the  database  as  measured  by  the 
reduction  in  class  impurity  within  clusters.  For  a  database 

with  L  known  classes,  let  c*  be  the  number  of  data 
elements  in  class  I  where  /  =  1, . . . ,  L .  Let  m  be  the  total 
number  of  data  points  in  the  database.  The  Total  Entropy 

L 

of  the  database  is:  Total  Entropy  —  X 

/=i 

Upon  convergence  of  the  classic  K-Means  algorithm  from 
a  given  initial  starting  point,  the  Weighted  Entropy  is 
computed  over  the  given  clustering  as  follows:  Form  the 
AT  X  L  cluster/class  matrix  C  with  the  (i,  j)  -th  element 
being  the  number  of  elements  of  class  y  belonging  to 


c 

m 


log 


c 

■  m  , 


cluster  i .  Notice  that  the  clustering  will  completely 
recover  the  assigned  classes  if  the  cluster/class  matrix  has 
a  permuted  identity  nonzero  structure.  Let  CS^  be  the  size 
of  the  k-th  cluster,  then  class  entropy  for  the  k-th  cluster  is 


given  by:  ClusterEntropy{k)  =  X 

/=! 


'k,l 

CS^ 


log 


The  weighted  entropy  of  the  entire  clustering  is  given  by: 


WeightedEntropy{K)  =  X 

it=i 


M 


Cluster  Entropy  (k) . 


Information  Gain  =Total  Entropy  -  Weighted  Entropy! K). 


Distortion:  Given  the  K  means  estimated  by  the  classic 
K-Means  algorithm,  the  distortion  value  that  we  consider 
is  simply  the  sum  of  the  L2  distance  squared  between  the 
data  items  and  the  mean  of  their  assigned  cluster.  A 
smaller  value  for  the  distortion  measure  indicates  that  the 
model  parameters  (i.e.  means)  are  a  better  fit  to  the 
database  given  the  K-Means  assumptions  are  true. 

Results:  Image  Segmentation  Database 

Average  information  gain  over  10  random  initial  points 
for  classic  K-Mean  without  refining  the  initial  point  was 
0.3125± 0.3188  (±one  standard  deviation).  Average 
information  gain  for  K-Mean  initialized  from  a  refined  (7 
=  10)  starting  point  was  0.8195  ±  0.1458.  The  amount  of 
information  gained  on  average  by  the  solutions  computed 
from  the  refined  point  was  2.6222  time  that  of  the 
solution  computed  over  the  random  initial  point. 

Furthermore,  on  average,  solutions  computed  from  the 
refined  initial  points  (J=10)  reduced  distortion  by  44.41% 
over  solutions  computed  from  random  initial  points. 

4.2  Other  Real  World  Datasets 
We  evaluated  the  refinement  procedure  on  other  data  sets 
such  as  Fisher’s  IRIS,  Star-Galaxy-Bright,  etc.  Because 
these  data  sets  are  very  low  dimensional  and  their  sizes 
small,  the  majority  of  the  results  were  of  no  interest. 

Clustering  these  data  sets  from  random  initial  points  and 
from  refined  initial  points  led  to  approximately  equal  gain 
in  entropy  and  equal  distortion  measures  in  most  cases. 
We  did  observe,  however,  that  when  a  random  starting 
point  leads  to  a  “bad”  solution,  then  refinement  indeed 
takes  it  to  a  “good”  solution.  So  in  those  (admittedly 
rare)  cases,  refinement  does  provide  expected 
improvement.  We  use  the  Reuters  information  retrieval 
data  set  to  demonstrate  our  method  on  a  real  and  difficult 
clustering  task. 

Reuters  Information  Retrieval  Data  Set 

The  Reuters  text  classification  database  is  derived  from 
the  original  Reuters-21578  data  set  made  publicly 
available  as  part  of  the  Reuters  Corpus,  through  available 
as  part  of  the  Reuters  Corpus,  through  Reuters,  Inc., 
Carnegie  Group  and  David  Lewis^.  This  data  consists  of 
12,902  documents.  Each  document  is  a  news  article  about 
some  topic:  e.g.  earnings,  commodities,  acquisitions, 
grain,  copper,  etc...  There  are  119  categories,  which 
belong  to  some  25  higher  level  categories  (there  is  a 
hierarchy  on  categories).  The  Reuters  database  consists  of 
word  counts  for  each  of  the  12,902  documents.  There  are 
hundreds  of  thousands  of  words,  but  for  purposes  of  our 
experiments  we  selected  the  302  most  frequently 


2  3 

For  a  more  detailed  description  of  the  data,  see  the  the  Irvine  ML  Data  See:  http://www.research.att.eom/~lewi.s/  reuters21578/  README.txt 

Repository  at  http://www.ics.uci.edu/-mleani/MLRepository.html  for  more  details  on  this  data  set. 
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Figure  7:  Results  on  Reuters  Data  from  5  Starting 
Points:  percentage  total  distortion  of  refined  solution 
relative  to  unrefined  solution. 


occurring  words,  hence  each  instance  has  302  dimensions 
indicating  the  integer  number  of  times  the  corresponding 
word  occurs  in  the  given  document.  Each  document  in 
the  IR-Reuters  database  has  been  classified  into  one  or 
more  categories.  We  use  A=25  for  clustering  purposes  to 
reflect  the  25  top-level  categories.  The  task  is  then  to  find 
the  best  clustering  given  K=25. 

Reuters  Results 

For  this  data  set,  because  clustering  the  entire  database 
requires  a  large  amount  of  time,  we  chose  to  only  evaluate 
results  over  5  randomly  chosen  starting  conditions. 
Results  are  shown  in  the  chart  of  Figure  7.  The  chart 
shows  a  significant  decrease  in  the  total  distortion 
measure.  On  average  the  distortion  of  a  solution  obtained 
by  starting  from  a  refined  initial  point  was  about  80%  of 
the  corresponding  distortion  obtained  by  clustering  from 
the  corresponding  randomly  chosen  initial  starting  point. 

Since  each  document  belongs  to  a  category  (there  are  119 
categories),  we  can  also  measure  the  quality  of  the 
achieved  by  any  clustering  by  measuring  the  gain  in 
information  about  the  categories  that  each  cluster  gives 
(i.e.  pure  clusters  are  informative).  This  is  done  in  the 
same  manner  we  measure  entropy  for  the  image 
segmentation  dataset  of  Section  4.1.  The  quality  of  the 
clusters  can  be  measured  by  the  average  category  purity 
in  each  cluster.  In  this  case  the  average  information  gain 
for  the  clusters  obtained  from  the  refined  starting  point 
was  4.13  times  higher  than  the  information  gain  obtained 
without  refining  the  initial  points.  The  information  gain 
for  the  refined  clustering  was  0.071  with  a  standard 
deviation  of  0.001.  While  the  unrefined  initial  points 
resulted  in  an  average  information  gain  of  0.017  with  a 
standard  deviation  equal  to  0.01 1. 

5.  CONCLUDING  REMARKS 

A  fast  and  efficient  algorithm  for  refining  an  initial 
starting  point  for  a  general  class  of  clustering  algorithms 
has  been  presented.  The  refinement  algorithm  operates 


over  small  subsamples  of  a  given  database,  hence 
requiring  a  small  proportion  of  the  total  memory  needed 
to  store  the  full  database  and  making  this  approach  very 
appealing  for  large-scale  clustering  problems.  The 
procedure  is  motivated  by  the  observation  that 
subsampling  can  provide  guidance  regarding  the  location 
of  the  modes  of  the  joint  probability  density  function 
assumed  to  have  generated  the  data.  By  initializing  a 
general  clustering  algorithm  near  the  modes,  not  only  are 
the  true  clusters  found  more  often,  but  it  follows  that  the 
clustering  algorithm  will  iterate  fewer  times  prior  to 
convergence.  This  is  very  important  as  the  clustering 
methods  discussed  here  require  a  full  data-scan  at  each 
iteration  and  this  may  be  a  costly  procedure  in  a  large- 
scale  setting. 

Computational  results  on  synthetic  Gaussian  data  indicate 
that  solutions  computed  by  the  K-Means  algorithm  from 
the  refined  initial  points  are  superior  to  the  random  initial 
starting  points  and  to  a  point  refined  over  a  single  random 
subsample.  Results  on  the  small  real-world  Image 
Segmentation  data  set  indicate  that  the  K-Means  solution 
from  the  refined  points  provide  twice  as  much 
“information”  than  the  solutions  computed  from  the 
random  initial  point.  Furthermore,  the  average  distortion 
is  decreased  by  9%.  Computational  results  on  the  Reuters 
database  of  newswire  stories  in  300  dimensions  indicate  a 
drop  in  distortion  by  about  20%.  Information  gain  was 
improved  by  a  factor  of  4.13  times  on  this  data  set. 

We  believe  that  our  method’s  ability  to  obtain  a 
substantial  refinement  over  randomly  chosen  starting 
points  is  due  in  large  part  to  our  ability  to  avoid  the  empty 
clusters  problem  that  plagues  traditional  K-Mcans.  Since 
during  refinement  we  reset  empty  clusters  to  far  points 
and  reiterate  the  K-Means  algorithm,  a  starting  point 
obtained  from  our  refinement  method  is  less  likely  to  lead 
the  subsequent  clustering  algorithm  to  a  “bad”  solution. 
Our  intuition  is  confirmed  by  the  empirical  results. 

The  refinement  method  presented  so  far  has  been  in  the 
context  of  the  K-Means  algorithm.  However,  we  note  that 
the  same  method  is  easily  be  generalized  to  other 
algorithms,  and  even  to  discrete  data  (on  which  means  are 
not  defined).  The  generalized  method  and  its  use  for 
initializing  the  EM  algorithm,  along  with  empirical 
results,  is  presented  in  [FRB98b].  The  key  insight  here  is 
that  if  some  algorithm  ClusterA  is  being  used  to  cluster 
the  data,  then  ClusterA  is  also  used  to  cluster  the 
subsamples.  The  algorithm  ClusterA  will  produce  a 
model.  The  model  is  essentially  described  by  its 
parameters.  The  parameters  are  in  a  continuous  space. 
The  stage  which  clusters  the  clusters  (i.e.  step  3  of  the 
algorithm  Refine  in  Section  2.2)  remains  as  is;  i.e.  we  use 
the  K-Means  algorithm  in  this  step.  The  reason  for  using 
K-Means  is  that  the  goal  at  this  stage  is  to  find  the 
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“centroid”  of  the  models,  and  in  this  case  the  harsh 
membership  assignment  of  K-Means  is  desirable. 
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Abstract 

We  show  finite-time  regret  bounds  for  the  mul¬ 
tiarmed  bandit  problem  under  the  assumption 
that  all  rewards  come  from  a  bounded  and  fixed 
range.  Our  regret  bounds  after  any  number  T  of 
pulls  are  of  the  form  a  -I-  6  log  T+c  log^  T,  where 
o,  b,  and  c  are  positive  constants  not  depending 
on  T.  These  bounds  are  shown  to  hold  for  vari¬ 
ants  of  the  popular  £-greedy  and  Boltzmann  al¬ 
location  rules,  and  for  a  new  simple  determin¬ 
istic  allocation  rule.  Moreover,  our  results  also 
apply  to  an  extension  of  the  basic  bandit  prob¬ 
lem  in  which  reward  distributions  can  depend,  to 
some  extent,  from  previous  pulls  and  observed 
rewards.  Finally,  we  discuss  the  empirical  perfor¬ 
mance  of  our  algorithms  with  respect  to  specific 
choices  of  the  reward  distributions. 

1  INTRODUCTION 

One  of  the  fundamental  issues  in  reinforcement  learning  is 
the  exploration  versus  exploitation  dilemma,  whose  sim¬ 
plest  instance  is,  perhaps,  the  bandit  problem.  In  its  most 
basic  formulation,  a  bandit  problem  is  a  set  of  N  (with 
N  >  1)  gambling  machines.  When  a  machine  is  played 
(i.e.,  the  “arm”  of  a  bandit  is  pulled)  it  delivers  a  reward, 
which  we  assume  here  to  be  a  number  from  a  fixed  and 
bounded  real  interval.  A  crucial  feature  is  that  each  reward 
is  an  independent  random  variable.  Moreover,  all  rewards 
delivered  by  the  same  machine  are  identically  distributed 
according  to  some  unknown  and  fixed  law  (note,  however, 
that  different  machines  may  have  different  reward  distribu¬ 
tions).  The  goal  of  the  player  in  the  optimality  model  con¬ 
sidered  here  is  to  minimize  its  regret,  that  is,  the  difference 
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between  the  expected  total  reward  gained  in  a  sequence  of 
T  plays  and  the  expected  total  reward  one  could  gain  by 
playing  T  times  any  machine  with  maximum  expected  re¬ 
ward.  The  exploration  versus  exploitation  dilemma  is  now 
clear;  the  player  must  trade-off  the  need  to  sample  differ¬ 
ent  machines,  in  order  to  compute  reliable  estimates  of  their 
expected  reward,  with  the  need  of  exploiting  the  machine 
with  the  highest  current  reward  estimate,  in  order  to  keep 
the  regret  as  low  as  possible  throughout  the  sequence  of 
plays. 

A  strategy  for  the  player,  also  called  “adaptive  alloca¬ 
tion  rule”,  is  a  method  for  selecting  the  arm  to  pull  at  each 
time  t  based  on  the  rewards  obtained  during  the  previous 
t  -  1  pulls.  The  classical  result  of  Lai  and  Robbins  [14] 
states  that,  asymptotically,  the  regret  of  any  player  strat¬ 
egy  must  be  n(logT),  provided  that  the  reward  distribu¬ 
tions  satisfy  some  mild  assumptions.  In  the  same  paper, 
Lai  and  Robbins  also  propose  a  general  adaptive  allocation 
rule  that,  whenever  the  reward  distributions  belong  to  some 
known  parametric  family,  yields  the  optimal  asymptotical 
regret  0(log  T)  —  see  [1 , 1 3]  and  references  therein  for  ex¬ 
tensions  of  these  results.  In  this  paper  we  show  that  simple 
variants  of  the  popular  e-greedy  and  Boltzmann  heuristics 
(see  [11,  16]  for  a  review  of  heuristics  for  the  bandit  prob¬ 
lem)  achieve  a  regret  of  the  form  o  -f-  61ogT  -I-  clog^T 
for  all  T  (where  a,  6,  and  c  are  positive  constants)  when 
a  lower  bound  on  the  difference  between  the  highest  and 
second-highest  expected  reward  is  known  in  advance.  We 
also  prove  that  the  same  regret  bound  holds  for  a  new  deter¬ 
ministic  allocation  rule.  Our  results  do  not  require  any  fur¬ 
ther  knowledge  about  the  distributions  of  rewards  and  hold 
for  any  set  of  distributions  with  bounded  rewards.  Finally, 
our  bounds  apply,  without  modification,  to  a  relaxed  vari¬ 
ant  of  the  bandit  problem,  where  the  reward  distributions 
can  adversarially  change  after  each  play  provided  that  each 
reward  expectation  is  kept  fixed. 

2  DEFINITIONS  AND  NOTATION 

Fix  a  positive  integer  N  >  1.  The  N -armed  bandit  prob¬ 
lem  (with  bounded  rewards)  is  a  collection  of  N  random 
processes  {Xj^t  ■  f  =  L  2,  •  •  •},  j  =  satisfying 
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0  <  Xj^t  <  1  (this  choice  for  the  range  of  rewards  is  not 
crucial,  by  an  appropriate  scaling  of  the  regret  bounds  any 
other  bounded  real  interval  would  work).  Each  Xj^t  repre¬ 
sents  the  random  reward  a  player  can  obtain  by  pulling  arm 
j  at  time  t.  An  adaptive  allocation  rule  for  the  iV-armed 
bandit  problem  is  an  algorithm  that,  at  each  time  t,  chooses 
the  index  It  G  {1, . . . ,  N}  of  the  next  arm  to  pull  based  on 
the  sequence 

of  past  pulls  and  observed  rewards.  We  will  also  investigate 
randomized  allocation  rules,  whose  behavior  depends  on  an 
additional  internal  random  source.  The  (expected)  regret 
after  the  first  T  pulls  of  the  allocation  rule  that  pull  arms 
/i , . . . ,  /t  is 


max  E 
\<j<N 


Y,{X^,t-Xia) 


Lt=i 


(1) 


Here  the  expectation  E  [•]  is  understood  with  respect  to  the 
stochastic  generation  of  rewards  and,  for  randomized  rules, 
also  with  respect  to  the  internal  randomization  of  the  rule. 

In  the  standard  formulation  of  the  bandit  problem  it 
is  assumed  that  the  rewards  delivered  by  the  arms  are  in¬ 
dependent  random  variables  Xj^t  with  stationary  means 
Pj,  for  f  =  1,2...  and  j  =  All  of  our 

results  will  hold  under  this  assumption.  Moreover,  all 
of  our  results  will  also  hold  under  the  weaker  assump¬ 
tion  that  E  [Xj^t  I  Xt-i\  =  pj  for  each  t  and  j,  where 
Xt-i  denotes  the  cr- field  generated  by  random  variables 
h,Xi^^i, . . .  In  other  words  the  distri¬ 

bution  of  each  new  random  reward  can  depend  in  an  adver¬ 
sarial  way  on  the  previous  pulls  and  observed  rewards,  as 
long  as  its  mean  is  kept  fixed. 

Throughout  the  paper,  without  loss  of  generality  as¬ 
sume  that  Pi  >  pj  for  all  j  =  2, ...,n  and  let 
A{pi pn)  =  min2<j<iv  {pi  —  Pj)  ■  Furthermore, 
let  Aj  =  pi  -  Pj  for  each  j  =  1,..., AT  and  let 
D  =  ^f-i  Aj.  Our  allocation  rules  have  an  input  pa¬ 
rameter  d  >  0  and  our  regret  bounds  hold  only  if  0  < 
d  <  A{pi,..., pj\f),  where  pi,...,ppi  are  the  unknown 
problem  parameters.  Furthermore,  our  bounds  grow  like 
H(l/cP),  so  d  should  be  chosen  as  close  as  possible  to 
A{pi,...,pn).  However,  if  an  arbitrary  value  of  d  is  fed 
to  the  allocation  rules  described  in  Section  3,  we  can  still 
prove  some  weaker  form  of  regret  bound. 

Finally,  we  use  In  for  the  natural  logarithm  and  log  for 
the  base  2  logarithm. 


3  REGRET  BOUNDS 

Many  heuristics  for  the  bandit  problem  assign  to  each  arm 
i  a  probability  of  being  pulled  that  is  proportional  to  the 
current  reward  estimate  for  arm  i.  A  popular  example  is 
the  Boltzmann  Exploration  (BE)  heuristic  (see  [3]  and  ref¬ 
erences  therein).  This  allocation  rule,  at  each  time  t,  draws 
the  arm  to  pull  according  to  the  exponential  distribution 


gAi.f  JZt  for  i  =  where  pi^t-i  is  the  cur¬ 

rent  estimate  of  the  expected  reward  for  arm  i,  the  quan¬ 
tity  r  >  0  is  a  “temperature”  parameter,  and  Zt  is  a  nor¬ 
malization  factor.  Note  that,  for  r  ->  0,  BE  reduces  to 
the  greedy  rule  always  choosing  to  pull  the  arm  with  the 
highest  current  reward  estimate.  On  the  other  hand,  for 
r  — >  00  arms  are  pulled  independently  and  uniformly  at 
random.  Similarly  to  the  Simulated  Annealing  optimiza¬ 
tion  method  [12, 17],  one  can  obtain  empirical  convergence 
to  the  best  arm  by  letting  r  =  tj  monotonically  decrease  to 
0  according  to  some  “cooling  shedule”.  A  natural  question 
is  then  whether  there  exists  a  cooling  schedule  which  prov- 
ably  yields  convergence  to  the  best  arm.  We  now  introduce 
a  variant  of  BE,  called  SoftMix,  for  which  we  can  con¬ 
struct  such  an  “optimal”  cooling  schedule.  The  algorithm 
SoftMix  (see  Figure  1)  uses  the  exponential  distribution 
mixed  with  the  uniform  distribution.'  This  is  equivalent 
to  saying  that,  at  each  time  t,  we  flip  a  biased  coin  to  de¬ 
cide  whether  the  next  arm  to  pull  should  be  drawn  from  the 
exponential  distribution  (with  a  prescribed  finite  tempera¬ 
ture  value)  or  from  the  uniform  distribution  (corresponding 
to  the  exponential  distribution  with  infinite  temperature). 
We  use  7t  to  denote  the  bias  (which  we  also  call  mixing 
coefficient)  of  the  coin.  A  crucial  aspect  is  that  both  the 
temperature  parameter  r*  and  the  mixing  coefficient  7t  de¬ 
crease  with  t  following  a  schedule  chosen  so  to  minimize 
the  regret  bound  in  our  analysis.  More  precisely,  we  set 
7t  =  0(ln(tyf)  and  Tt  =  ©(l/ln(f)).  For  notational con¬ 
venience,  Tt  is  replaced  by  an  “inverse  temperature”  param¬ 
eter  l/rjt.  The  performance  of  SoftMix  is  analyzed  in  the 
following  result. 

Theorem  3.1  For  all  integers  N  >  1  and  for  all  N -armed 
bandit  problems  with  parameters  pi,...,pN,  ifO  <  d  < 
A{pi,..  .,pn)  then,  for  all  T  >  1,  the  regret  after  the 
first  T  pulls  of  the  randomized  allocation  rule  SOFTMix 
described  in  Figure  1  is  at  most 

Recall  that  in  the  “zero  temperature  limit”,  i.e.  when 
the  temperature  parameter  r  approaches  0,  BE  becomes 
greedy:  at  each  time  t,  the  arm  i  maximizing  the  reward  es¬ 
timate  pi,t-i  gets  pulled  (ties  are  broken  at  random).  The 
obvious  flaw  in  this  strategy  is  that  an  early  unlucky  sam¬ 
pling  of  some  suboptimal  arm  might  prevent  the  optimal 
arm  from  being  sampled  enough.  A  more  successful  vari¬ 
ant  of  the  greedy  rule  is  the  so-called  e-greedy  heuristic 
(see,  e.g.,  [18]).  At  each  time  t,  this  strategy  pulls  with 
probability  1  -  e  any  arm  with  the  highest  reward  estimate 
and  pulls  with  probability  e  a  randomly  chosen  arm.  Now 
note  that  the  zero  temperature  limit  of  SoftMix  (attained 
when  the  inverse  temperature  parameter  r)t  approaches  in¬ 
finity)  corresponds  to  the  e-greedy  heuristic  wifo  the  setting 

'The  same  mix  was  used  in  [2].  However,  here  the  mixing  co¬ 
efficient  is  dynamically  adapted  to  minimize  the  regret  uniformly 
over  time,  whereas  in  [2]  it  was  kept  constant. 
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Randomized  allocation  rule:  SoftMix. 

Input:  Real  number  0  <  d  <  1. 

Initialization:  Define  sequences  ■jt  €  (0, 1]  and  t]i  >  0, 
t  =  1,2,...,  by 


and 


It  = 


Vt  = 


ift  >  2, 
otherwise 


1  A  ,  d{Nht  +  l)\ 

Nht  +  1  V  2Ar/7,  -(Pj- 


(2) 


(3) 


Let  Sj  =  0  for  j  = 

Loop:  For  each  t  =  1, 2, . . . 

•  Pull  an  arm  drawn  from  the  distribution 
{Pi  ,t,...,Pw,t},  where 

gT/tS*  yy. 

•  Let  it  be  the  index  of  the  pulled  arm  and  Xi,  the 
reward  obtained.  Add  xt,  /Pit,t  to  Si, . 


Figure  1:  Description  of  the  randomized  allocation  rule 
SoftMix. 


Randomized  allocation  rule:  GreedyMix. 

Input:  Real  number  0  <  d  <  1. 

Initialization:  Define  the  sequence  £  (0^  1].  f  = 
1,2,....  by 


5N  In(t-l) 
W  t-1 


I  iff  >  2, 

otherwise. 


Let  Sj  =  0  for  j  =  1, . . . ,  A^. 

Loop:  For  each  t  =  1, 2, . . . 

•  Let  X  be  the  subset  of  arms  such  that,  for  each  i  G 
I,  Si  =  maxi<j<ArSj. 

•  With  probability  1  -  7t  pull  a  random  arm  in  I, 
with  probability  7(  pull  a  random  arm. 

•  Let  it  be  the  index  of  the  pulled  arm  and  Xj,  the 
reward  obtained.  Add  Xt, to  Si,,  where  Qj^t 
is  the  probability  of  it  =  j  according  to  the  rule 
above,  that  is 


Qj,t 


j{i-it)/\x\  +  yt/N  ifjel, 

I  jt/N  otherwise. 


(5) 


Figure  2:  Description  of  the  randomized  allocation  rule 
GreedyMix. 


£  =  7t.  This  observations  suggests  that  the  two  allocation 
rules  might  have  similar  behaviour,  especially  when  t  is 
large.  The  experiments  of  Section  6  (see  Figure  4)  confirm 
this  conclusion:  SoftMix  has  a  better  start  but,  for  t  large 
enough,  the  two  algorithms  exhibit  the  same  behavior.  On 
the  other  hand,  we  now  state  an  upper  bound  on  the  regret 
of  the  7f-greedy  heuristic  identical  to  the  one  we  proved 
for  SoftMix.  So,  with  respect  to  our  analysis,  the  more 
sophisticated  selection  method  used  by  SoftMix  does  not 
provide  any  extra  benefit. 

Theorem  3.2  For  all  integers  N  >l  and  for  all  N -armed 
bandit  problems  with  parameters  pi,. . . ,  pN,  if0<d< 
A{pi,. .  ■,Pn)  then,  for  allT  >  1,  the  regret  after  the  first 
T  pulls  of  the  randomized  allocation  rule  GreedyMix  de¬ 
scribed  in  Figure  2  is  at  most 

Some  heuristics  for  the  bandit  problem,  like  the  so-called 
“optimism  in  the  face  of  uncertainty”  exhibit  a  two-phase 
behaviour  (see  [11,  Section  2.2.1]  for  a  list  of  references). 
In  the  first  phase  exploration  is  favored;  in  the  second  phase 
exploitation  takes  over  and  the  heuristic  operates  in  a  com¬ 
pletely  greedy  way.  By  extending  the  initial  explorative 
phase  long  enough  one  can  make  arbitrarily  small  (though 
not  vanishing)  the  risk  of  converging  to  a  suboptimal  arm. 


We  propose  a  new  strategy,  called  ROUNDS,  where  a 
purely  explorative  phase  is  alternated  with  a  purely  ex¬ 
ploitative  phases.  To  guarantee  a  good  bound  on  the  regret, 
the  length  of  the  r-th  exploitation  is  2’’  whereas  the  length 
of  the  exploration  phases  grows  only  linearly.  The  theoret¬ 
ical  performance  of  ROUNDS  (which  is  a  deterministic  al¬ 
location  rule)  turns  out  to  be  comparable  to  that  of  the  ran¬ 
domized  strategies  SoftMix  and  GreedyMix.  On  the 
other  hand,  our  experiments  indicate  that  both  randomized 
rules  have  an  expected  performance  better  than  ROUNDS, 
especially  for  small  values  of  T. 

Theorem  3.3  For  all  integers  N  >  1  and  for  all  N -armed 
bandit  problems  with  parameters  pi,...,  Pn<  if0<d< 
A{pi ,...,pn)  then,  for  all  T  >  1,  the  regret  after  the 
first  T  pulls  of  the  deterministic  allocation  rule  ROUNDS 
described  in  Figure  3  is  at  most 

^2D\ogi2FO  ^  ^  ^  ^riog(T  +  1)1^  • 

Our  results  of  Section  3  hold  under  the  assumption  that  a 
lower  bound  d  >  0  on  the  smallest  difference  pi  -  pj, 
j  ^  lis  known.  Arbitrary  values  of  d,  however,  still  allow 
to  prove  reasonable  bounds  on  the  regret.  In  fact,  the  re¬ 
gret  bound  is  similar  as  before  with  an  additional  AT  term, 
where  A  is  the  difference  between  pi  (the  highest  expected 
reward)  and  the  smallest  pi  strictly  bigger  than  pi  -  d.  (If 
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Allocation  rule:  rounds. 

Input:  Real  number  0  <  d  <  1. 

Loop:  For  each  round  r  =  0, 1, . . . 

•  Letr^  =  [2(log2(2iV)  +r)/d2]. 

•  For  each  arm  i  =  Pull  the  arm  i  for 

times  and  let  Sj,,.  the  total  reward  obtained. 

•  Let  k  be  such  that  Sk,r  =  maxi<j<Ar  Sj,r-  Pull 
arm  k  for  2’’  times. 


Figure  3:  Description  of  the  deterministic  allocation  rule 
ROUNDS. 


d  not  larger  than  the  smallest  difference  fJ-i  —  fij  (J  ¥"  1)> 
then  A  is  0  and  we  recover  the  previous  bound.) 


Corollary  3.4  For  all  integers  >  1,  for  all  N -armed 
bandit  problems  with  parameters  and  for  all 

T  >  1,  the  regret  of  both  randomized  allocation  rules 
SoftMix  and  GreedyMix  with  input  d>  0  is  at  most 


where 


and 


A«r+^(81n^+31n=T) 


Aj(d)  =  max  Ai 


Did)  =  ^  A,-  . 


4  REMARKS 

Unbiased  estimates.  The  randomized  allocation  rules 
SoftMix  and  GreedyMix  use  a  special  kind  of  estimate, 
for  the  expected  reward  of  each  arm  i.  If  P,,*  is  the 
probability  of  pulling  arm  i  at  time  s,  then  the  reward  esti¬ 
mate  at  time  t  for  arm  i  is 


1 

t-1 


1 

t-1 


t-1 

EA 

8=1 


(6) 


where  =  Xi^slPi,s  if  arm  i  was  pulled  at  time  s  and 
Xi^s  =  0  otherwise.  This  estimate,  which  was  previously 
used  in  [2]  to  solve  a  variant  of  the  bandit  problem  substan¬ 
tially  different  from  the  one  studied  here,  has  the  correct 
expectation  pi  for  each  arm  i.  In  fact,  we  have 


E 


=  E 


1,8 


Pi,s  +  0  (1  -  Pi,,) 


=  Pi  ■ 


We  could  not  prove  our  results  for  a  different  choice  of  the 
reward  estimates. 


Cooling  schedule.  In  order  to  compare  the  inverse  tem¬ 
perature  parameter  r]t  of  SoftMix  with  the  temperature 
parameter  tj  of  BE,  in  Section  3  we  said  that  the  se¬ 
quence  of  values  %  for  t  —  1,2, . . .  corresponds  to  a  cool¬ 
ing  schedule  Tt  =  0(1/ Inf).  To  see  why,  recall  that 
the  expression  for  the  probability  of  drawing  arm  i  in  BE 
has  Pi, t-1  In  at  the  exponent,  where  pi,t-i  is  the  cur¬ 
rent  estimate  of  the  expected  reward  for  arm  i.  The  cor¬ 
responding  exponent  for  the  probability  of  drawing  arm  i 
in  SoftMix  is  Si,t-ir]t  (we  are  disregarding  the  contribu¬ 
tions  of  the  factor  1  -  7t  and  of  the  term  jt/N,  both  negli¬ 
gible  for  t  large).  As,  by  (6),  SoftMix’s  reward  estimate 
issj,t_i/(f-  1),  we  get  that  pi,t-i In  =  Pi,i-iit-  1)774. 
Hence  n  =  l/((f  —  1)774).  Asymptotically,  the  quantity 
(f  —  1)774  shows  a  logarithmic  growth 

Recall  that  the  idea  of  BE  with  cooling  is  borrowed  from 
the  Simulated  Annealing  (SA)  optimization  method.  A  re¬ 
markable  fact  is  that  the  cooling  schedule  necessary  and 
sufficient  for  convergence  (with  probability  1)  of  SA  to  the 
global  optimum  is  also  0(1/  In  t),  as  shown  in  [7]. 


Instantaneous  regret  bounds.  The  proof  of  Theo¬ 
rems  3.1  and  3.2  also  yields  bounds  on  die  instantaneous 
regret  of  both  SoftMix  and  GreedyMix.  In  particular, 
for  all  t  >  \i8N/cP)  IniSN/cP)] , 


E[Xi,4-A4„4]< 


D  5Z)ln(f-l) 
(Pit-1) 


where  it  is  the  arm  pulled  at  time  t  by  any  allocation  rule 
between  SoftMix  and  GreedyMix. 


Similarity  of  the  regret  bounds.  The  dominant  term 
in  the  regret  bound  for  the  three  allocation  rules  consid¬ 
ered  here  is,  recalling  that  D  =  0(A),  of  the  order  of 
iN/(P)  log^  T.  This  similarity  is  not  by  accident.  In  Sec¬ 
tion  5  we  show  how  the  regret  of  both  SoftMix  and 
GreedyMix  can  be  reduced  to  the  expectation  of  the 
product  of  moment  generating  functions  for  certain  random 
variables  —  see  (9)  and  (15).  This  product  is  bounded  term 
by  term  using  Taylor  expansion  of  ffie  exponential  function. 
For  the  deterministic  rule  ROUNDS,  we  control  the  accuracy 
of  the  worst  current  reward  estimate  via  standard  Hoeffd- 
ing  bounds.  As  Hoeffding  bounds  are  again  proven  through 
Taylor  bounds  on  the  moment  generating  function,  we  get 
similar  rates  for  the  regret.  Observe  that,  in  both  cases,  the 
Taylor  expansion  heavily  relies  on  the  boundedness  of  the 
rewards. 


Interval  estimation  method.  Another  popular  allocation 
rule,  which  works  very  well  in  empirical  trials,  is  Kael- 
bling’s  interval  estimation  method  [10].  This  method  op¬ 
erates  by  computing  upper  bound  estimates  Ui,t  for  the  ex¬ 
pected  reward  pi  of  each  arm  i  satisfying 

P{pi>Ui,t}  =  e  (7) 
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for  some  parameter  0  >  0.  The  interval  estimation  rule 
picks,  at  each  time  t,  the  arm  i  maximizing  for  a  fixed 
value  of  0.  Clearly,  to  compute  Uj.t  satisfying  (7)  one 
needs  some  information  on  the  reward  distributions.  For 
Bernoulli  bandits  (with  rewards  chosen  in  {0, 1}  for  each 
arm),  one  can  use  the  Normal  approximation  to  the  bino¬ 
mial  distribution  and  then  apply  standard  formulae  to  com¬ 
pute  the  quantities  For  unknown  reward  distributions, 
which  is  the  case  of  our  setup,  one  must  resort  to  general 
estimates  in  much  the  same  way  we  used  Hoeffding  bounds 
to  control  the  regret  of  ROUNDS. 

5  PROOFS 


<  (1  -  7t)E 


e::: 


+ 


li 

N 


=  (1  -  7t)E 


L«=i 


+ 


IL 

N 


t-i 


L«=i 

for  Zi^t  =  Xi^t  —  +  ^i- 


We  will  make  use  of  the  following  fact  which  can  be  easily 
verified  by  Taylor  expansion  of  the  exponential  function  (a 
proof  can  be  found  in  [15,  page  155]). 


Fact  5.1  For  every  real  c  >  0,  define  the  function  <pc  on 
the  positive  reals  by  4>c{z)  —  (e'*  -  1  —  cz)l(?.  Then,  for 
every  y  <c  and  every  z  >  0,  <  1  +  zy  +  <j)c{z)y‘^. 


Proof  of  Theorem  3.1.  Let  Pj^t  =  P  {7t  =  j  |  Tt-\) 
be  defined  as  in  (4).  Recall  that  we  are  assuming  px  = 
maxi<i<;v  /it-  We  rewrite  the  regret  (1)  as  follows 


max  E 
\<j<N 


=  E 

T 


,t=i 

t=i 

=  ^E[Xi,t-X/.,t] 
t=i 
T 

=  ;^E[E[Xi,f-X/„H:^t-i] 


t=l 


t=l 

T  N 


N 


j=2 


J2'£AjE[Pj,t]  . 

t=l  j=2 


(8) 


In  (8)  the  inner  conditional  expectation  is  understood  with 
respect  to  both  the  random  choice  of  It  and  the  random  re¬ 
alization  of  the  reward  Xp^f  The  outer  expectation  simply 
averages  over  the  past  t  —  I  pulls  and  obtained  rewards. 
Define  random  variables 


Xj,t 


/Pj,t 


if  it  =  j, 
otherwise. 


Note  that  Xj^t  <  ^/Pj,t  <  N/'ft,  a  fact  which  we  use 
several  times  throughout  this  section.  We  have 


E[Pi,t]  =  (l-70E 


+  1L 
N 


In  the  last  step  we  multiplied  and  divided  by  the  same  quan¬ 
tity  and  then  we  dropped  the  factor  0  <  l-7t  < 

1.  In  view  of  bounding  each  factor  E  '  |  Pg-ij  via 
Fact  5.1,  first  we  compute  the  conditional  expectation  for 
each  s, 

E  [Zi,g  I  Pg_i] 

=  E  [Xi,.  I  Ps-i]  -  E  [Xi,g  I  Pg-i]  +  Ai 

=  /ii  -  /ii  +  Ai  =  0  . 


Second,  observing  that  7*  is  positive  and  nonincrcasing  in 
s,  we  get  Zi^a  =  Xi,a  -  -^1,0  +  Aj  <  N/jt  +  1-  Third, 
using  the  same  observation,  we  also  bound  the  conditional 
variance  for  each  s  as  follows. 

E  [zl  I  Pa-i]  =  E  [(Xi,.  -  Xi,,y  I  P.-I 

-f  Aj  -I-  2Ai(/ij  —  Pi) 

=  E  |Pg-i]  -  A? 

=  E[x2jPg_,] -bE[x2jPg_i] 

-2E[Xi,gXi,g  IPg-i]  -  A? 

=  E  [Xf,  J  Pg_i] -b  E  I  Pg_i]  -  A? 
as  Xi,e  =  0  or  Xj,,  =  0 


=  E 


X^ 


-l-E 


K 


-^Pi,g-l-0(l-Pi,g)|Pg_i  -Ai 

■^1,8 


1  1  a2^2A^  a2 

as  Pi,,  >  yt/N  for  i  =  1, . . . ,  AT. 

Hence,  applying  Fact  5.1  with  c  =  W/7t  +  1  and  z  =  T]t 
we  find  that 


E 


I  Pa-i]  <  E  [1  +  r]tZi,a  +  ZlMrit)  \  ^«-i 
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<  1  + 


Mm) 


on  the  regret 


<  exp  MVt) 


for  all  s  =  1, . . . ,  f  —  1.  Thus 


E[Pm] 


n  exp  (^Mvt)  -  A?)  ^ 


1 1 

=  exp  ^-(t  - 1)  ^7?tAi  -  Mm)^  +  Mm)^iyj 

+  1L 

N 

<  exp  (^-{t  -  1)  (^ritd  -  Mm)^  +  Mm)d‘^yj 


where  (10)  is  shown  using  the  assumption  d  < 
By  letting  (t  -  l)d  =  K  and  (t  — 
l)(2iV/7t  -  (f)  =  the  term  at  the  exponent  in  (10) 
becomes 

-Krit  +  Mm)<^^  ■  (11) 

Rewriting  rjt  in  the  simpler  form  (1/c)  ln(l  +  cKfa'^),  re¬ 
placing  0c  with  its  definition,  and  using  the  elementary  in¬ 
equality  ln(l  +  a;)  >  2a;/ (2  -t-  a:),  a:  >  0,  we  get  that  the 
quantity  in  (11)  is  smaller  or  equal  to  -R'^/(2cr^  +  cK). 
Hence,  plugging  back  in  the  original  expression  for  K,  a^, 
and  c,  and  simplifying  the  f  -  1  factor,  we  find  that  (10)  is 
smaller  or  equal  to 

(  \  7t 

V  4iV/7t  -2cP-\-  d{N/jt  +  1))'^  N 

<exp(-(Ll^)  +  ^.  ,12) 

Now,  for  f  >  To  =  [(8iV/d^)  ln(5Ar/d^)] ,  we  have  that 
5N  ln(f  -  1)  , 

The  choice  of  balances  the  contributions  of  the  terms  in 
the  right-hand  side  of  (12).  Thus,  for  all  t  >  Tq  and  all 
j  e  {2, . . . ,  A^},  we  can  further  bound  the  right-hand  side 
of  (12)  as  follows 

exn  (  J_  ,  5  ln(f  -  1) 

\  5N  )  N  ~  t  —  1  cP  t  - 1 

For  f  <  To  the  mixing  coefficient  7t  is  1.  Hence,  for  each 
such  t,  the  regret  is  Aj  with  probability  1/N  for  each  j. 
Piecing  everything  together  we  obtain  the  desired  bound 


•  T 
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N  /To  T  \ 

3=1  \t=l  t=To+l  / 


N  /To 


1  ^ 

=  E^  Ev+  E  EK.' 


j=l  \t=l  t=To+l 


.  A  .  /  8  5N  A  1 

-  E  M  d2  +  E  t_l 

]=1  \  t=To+l 

5  A  ln(f  -  1)  \ 
cP  ^  t-1 

f=To-|-l  / 

i'=2  ^  ' 


Proof  of  Theorem  3.2.  The  regret  (1)  can  be  re-written  as 
follows 

Lt=i 

■  T 

=  E 

.t=l 

T 

=  Y^E[X3,t-XjM 

t=l 
T  N 

=  EE^iP{^‘=j} 

t=l  j=2 


=  E^  Ep{^‘=j} 

3=2  \t=l 


We  now  bound  P  {It  =  j}  for  each  j  e  {2, . . . ,  N}.  To 
this  end,  define  random  variables 

v.  _  /  It  — 

^  0  otherwise 

where  the  probability  Qj^t  =  P  =  /  I  ^t-i)  is  defined 
in  (5).  Let  2*  be  the  subset  of  {1, ,  N}  such  that,  for 
each  i  £Xt, 

t-i  t-i 

EA«= 

S=1  -  -  S=1 
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For  any  fixed  j,  we  have 

P  {It  =  j}  =  P  [It  =j\je  It}  P  [j  e  It} 

+  P{It=j\j^It}lP{j^It}  (14) 


<  P0‘eit}  + 


IL 

N 


<  p 


=  p 


|e  -  ^i-»)  ^  0  ^ ^ 


li 

N 


{S 


vAs  >m{t-  1)A,-  [  +  ^ 


for  Zj^t  =  Xj^t  -Xi,t  +  Aj  and  rjt  >  0  arbitrary 
=  p  >  l}  +  ^ 

as  P  {X  >  1}  <  E  [X]  for  any  positive  r.v.  X 


g-»)((t-l)Ay  jj 


L«=i 


+  ^■05) 


The  proof  is  concluded  by  noting  that  jt  is  chosen  as  in  (2), 
Tjt  can  be  set  as  in  (3),  and  (15)  is  thus  equal  to  (9)  as  in  the 
proof  of  Theorem  3.1.  D 

Proof  of  Theorem  3.3.  Fix  a  positive  integer  T  and  choose 
any  integer  r  >  0.  Let  /ij,r  =  Sj, r/Tr.  where  Sj,r  is 
the  total  reward  for  arm  j  during  round  r.  By  hypothe¬ 
sis,  E [Xj,t  I  I^t-i]  =  Fj  for  all  j  =  and  all  t. 

Thus  we  can  apply  Hoeffding  bounds  [8]  and  obtain,^  for 
each  fixed  j  and  for  A  =  A{/ii , . . . ,  /x/v), 

P{|Ai,r-/Xil>A/2}<2e-^'^-/2 
1 


<  < 


N2’- 


Hence,  P  {3j  |Ai,r  -  Mil  >  A/2}  <2  Therefore,  the 
regret  during  round  r  is  at  most 

Y,  Ai  )  Tr  +  2Ar^D  r2{log(2iV)  -h  r)ld'^}  +  1  . 

(  J  / 


Let  t  the  total  number  of  rounds.  That  is,  i  is  the  smallest 
integer  such  that 


c. 

Y  (XTr  +  2^)>T. 


r=0 


Clearly,  £  <  t',  where  £'  is  the  smallest  integer  such  that 
Er=o2’‘  >  T.  Thus  f  -f  1  <  riog(T  -f  1)].  Without  loss 
of  generality,  assume  the  last  round  ends  exactly  at  time  T 
(if  it  ends  before,  then  the  bound  gets  better).  We  find  that 


^Note  that  Hoeffding  bounds  can  be  applied,  without  modi¬ 
fication,  also  to  the  more  general  bandit  model  where  the  reward 
distributions  can  adversarially  change  after  each  play  provided  the 
reward  expectations  are  kept  fixed. 
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This  concludes  the  proof. 


□ 


Proof  of  Corollary  3.4.  Following  (13),  the  regret  of 
GreedyMix  can  be  written  as 


max  E 

l<j<N 


=  5;A,('f;p  {/.=,)' 
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j=2 
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Kt=\ 


<E  = 

t=l  {i  -. 

T  N 

+  E 
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(  max  Av  I 


+  Y 

=  At^,)T  +  Y  E  • 

‘=1 

The  proof  now  goes  along  the  same  lines  as  the  proof  of 
Theorem  3.1.  The  analysis  of  the  regret  of  SoftMix  is 
similar. 
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Figure  4:  The  two  graphs  are  averages  over  1,000  runs  of  100,000  pulls  each.  We  divided  each  run  in  1000  blocks  of  100 
pulls  each.  At  left  we  plot  the  fraction  of  times  the  true  best  arm  was  pulled  in  each  block.  At  right  we  plot  the  average 
reward  per  block  divided  by  the  highest  expected  reward  for  that  run. 


6  EXPERIMENTS 

We  tested  the  three  algorithms  on  a  ten-armed  bandit  prob¬ 
lem.  Due  to  the  boundedness  condition,  the  rewards  were 
drawn  from  beta  distributions  whose  range  is  the  unit  inter¬ 
val  [0, 1],  In  our  experiments  we  averaged  1,000  runs.  In 
each  run  the  two  parameters  of  the  beta  distributions  were 
chosen  uniformly  and  independently  for  each  arm  from  the 
real  interval  [2, 12].  The  parameter  d  was  set  to  the  best 
possible  choice  A(jui, . . . ,  /ijv)-  All  other  constants  were 
set  as  shown  in  Figures  1 ,  2,  and  3.  In  the  plots  of  Figure  4 
we  compare  the  performance  of  SoftMix,  GreedyMix, 
and  ROUNDS  on  100,000  pulls.  Observe  that  SoftMix 
and  GreedyMix  have  a  similar  performance,  although 
SoftMix  performs  slightly  better  after  a  slower  short  ini¬ 
tial  phase.  ROUNDS  has  the  slowest  convergence,  probably 
due  to  the  long  initial  domination  of  explorative  phases. 
Tests  with  up  to  400,000  pulls  show  that  the  ranking  of  the 
three  algorithms  stays  the  same,  though  rounds  consid¬ 
erably  improves  in  the  long  run  as  exploitation  takes  over 
exploration.  Note  that  our  setting  of  the  constants  for  the 
mixing  coefficient  is  independent  of  any  property  of  the 
reward  distributions  other  than  the  parameter  d.  Hence,  it 
is  conceivable  (as  we  indeed  observed  in  the  experiments) 
that  more  informed  choices  of  the  constants  in  7*  could  lead 
to  a  better  empirical  performance  for  specific  reward  distri¬ 
butions.  Finally,  we  ran  tests  for  two  other  distributions 
of  the  rewards:  Bernoulli  distribution  (rewards  in  {0, 1} 
with  expectation  of  revvard  1  chosen  independently  and  uni¬ 
formly  from  [0, 1]  for  every  arm)  and  uniform  distributions 
on  [0,0],  where  the  parameter  a  is  chosen  independently 
and  uniformly  from  [0, 1]  for  every  arm.  The  empirical  re¬ 
sults  did  not  differ  much  from  those  obtained  for  the  beta 


distribution.  We  plan  to  carry  out  experiments  in  order  to 
test  our  algorithms  against  Boltzmann  Exploration  and  In¬ 
terval  Estimation. 

7  CONCLUSIONS 

The  main  contribution  of  this  paper  is  the  derivation  of 
finite-time  regret  bounds  for  variants  of  widely  used  heuris¬ 
tics  for  the  bandit  problem.  Our  results  demonstrate  that 
the  average  reward  per  pull  obtained  by  any  one  of  these 
variants  converges  to  that  of  the  best  arm,  and  we  show 
explicit  bounds  on  the  convergence  rate.  We  remark  that, 
rather  than  improving  the  empirical  performance  on  spe¬ 
cific  domains,  our  main  interest  is  the  understanding  of  the 
nature  of  basic  methods  like  Boltzmann  Exploration,  and 
the  derivation  of  rigorous  regret  bounds  that  are  guaranteed 
to  hold  in  a  vast  range  of  situations. 

Our  work  can  be  extended  in  many  ways.  A  more 
general  version  of  the  bandit  problem  is  obtained  by  re¬ 
moving  the  stationarity  assumption  on  reward  expecta¬ 
tions  (see  [4,  6]  for  extensions  of  the  basic  bandit  prob¬ 
lem).  For  example,  suppose  that  a  stochastic  reward  pro¬ 
cess  {2fi(s)  :  s  =  1, 2, . . .}  is  associated  to  each  arm  i  = 
1, . . . ,  iV.  Here,  pulling  arm  i  at  time  t  yields  a  reward 
Xi{s)  and  causes  the  current  state  s  of  arm  i  to  change  to 
s  -I- 1,  whereas  the  states  of  the  other  arms  remain  frozen. 
A  well  studied  problem  in  this  setup  is  the  maximization  of 
the  total  expected  reward  in  a  sequence  of  T  pulls.  There 
are  methods,  like  the  Gittins  allocation  indices,  that  allow 
to  find  the  optimal  arm  to  pull  at  each  time  t  by  considering 
each  reward  process  independently  from  the  others  (even 
though  the  globally  optimal  solution  depends  on  all  the  pro- 
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cesses).^  However,  computation  of  the  Gittins  indices  re¬ 
quires  preliminary  knowledge  about  the  reward  processes. 
To  overcome  this  requirement,  one  can  learn  the  Gittins  in¬ 
dices,  as  proposed  in  [5]  for  the  case  of  finite-state  Marko¬ 
vian  reward  processes.  However,  there  are  no  finite-time 
regret  bounds  shown  for  this  solution.  At  the  moment,  we 
do  not  know  whether  our  techniques  could  be  extended  to 
these  more  general  bandit  problems. 

Another  open  problem  is  whether  the  bounds  we  prove 
are  tight  for  each  one  of  the  three  algorithms,  and  whether 
they  are  optimal  for  the  bandit  problem  considered  here. 
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Abstract 

Bayesian  algorithms  for  Neural  Networks  are  known  to 
produce  classifiers  which  are  very  resistent  to  overfit¬ 
ting.  It  is  often  claimed  that  one  of  the  main  distinc¬ 
tive  features  of  Bayesian  Learning  Algorithms  is  that 
they  don’t  simply  output  one  hypothesis,  but  rather 
an  entire  distribution  of  probability  over  an  hypothesis 
set:  the  Bayes  posterior.  An  alternative  perspective  is 
that  they  output  a  linear  combination  of  classifiers, 
whose  coefficients  axe  given  by  Bayes  theorem.  One 
of  the  concepts  used  to  deal  with  thresholded  convex 
combinations  is  the  ‘margin’  of  the  hyperplane  with 
respect  to  the  training  sample,  which  is  correlated  to 
the  predictive  power  of  the  hypothesis  itself. 

We  provide  a  novel  theoretical  analysis  of  such  clas¬ 
sifiers,  based  on  Data-Dependent  VC  theory,  proving 
that  they  can  be  expected  to  be  large  margin  hyper¬ 
planes  in  a  Hilbert  space.  We  then  present  experimen¬ 
tal  evidence  that  the  predictions  of  our  model  are  cor¬ 
rect,  i.e.  that  bayesian  classifers  really  find  hypotheses 
which  have  large  margin  on  the  training  examples. 
This  not  only  explains  the  remarkable  resistance  to 
overfitting  exhibited  by  such  classifiers,  but  also  co¬ 
locates  them  in  the  same  class  of  other  systems,  like 
Support  Vector  machines  and  Adaboost,  which  have  a 
similar  performance. 

Keywords:  Bayesian  Classifiers,  Large  margin  hyper¬ 
planes,  Hilbert  space 


1  INTRODUCTION 

Bayesian  learning  algorithms  for  neural  networks  of 
the  kind  described  in  [3]  are  often  claimed  to  have  the 
distinctive  feature  of  outputting  an  entire  distribution 
of  probability  over  the  hypothesis  space,  rather  than 
a  single  hypothesis.  Such  a  distribution,  the  Bayes 


posterior,  depends  on  the  training  data  and  on  prior 
distribution,  and  is  used  to  make  predictions  by  aver¬ 
aging  the  predictions  of  aU  the  elements  of  the  set,  in 
a  weighted  majority  voting  scheme. 

The  posterior  is  computed  according  to  Bayes’  rule, 
and  such  a  scheme  has  the  remarkable  property  that  - 
as  long  as  the  prior  is  correct  and  the  computations  can 
be  performed  exactly  -  its  expected  test  error  is  mini¬ 
mal.  Typically,  the  posterior  is  appoximated  by  com¬ 
bining  a  gaussian  prior  and  a  simplified  version  of  the 
likelihood  (the  data-dependent  term,  that  is  the  term 
that  reflects  the  information  gleaned  from  the  train¬ 
ing  set).  Such  a  distribution  is  then  sampled  with  a 
Montecarlo  method,  to  form  a  committee  whose  com¬ 
position  reflects  the  posterior  probability.  The  predic¬ 
tive  integral  over  a  posterior  distribution  can  hence  be 
replaced  by  a  sum. 

The  classifiers  obtained  with  this  method  are  known  to 
be  highly  resistent  to  overfitting.  Indeed,  neither  the 
committee  size  nor  the  network  size  strongly  affect  the 
performance,  to  such  an  extent  that  it  is  not  uncom¬ 
mon  -  in  the  bayesian  literature  -  to  find  computations 
with  “infinite  networks”  [4],  [10],  meaning  by  this  the 
posterior  over  the  complete  (infinite)  hypothesis  space. 

Statistical  Learning  Theory,  on  the  other  hand,  is  con¬ 
cerned  with  the  problem  of  bounding  the  test  error  (in 
the  worst  case  and  with  high  probability)  using  quan¬ 
tities  that  are  observable  in  the  training  set  or  known 
a  priori  [9]. 

The  expressions  obtained  for  such  a  bound  typically 
depend  on  the  training  error,  the  sample  size  and  the 
VC  dimension  of  the  classifier.  Given  that  the  number 
of  tunable  parameters  gives  a  rough  estimation  of  the 
VC  dimension,  the  size  of  the  network  and  that  of  the 
committee  do  matter. 

A  more  refined,  Data-Dependent,  version  of  the  theory 
introduced  in  [8],  shows  that  it  is  possible  to  replace 
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the  VC  dimension  in  the  above  mentioned  bounds  with 
a  quantity  which  depends  on  the  margin  of  the  classi¬ 
fier  on  the  training  examples. 

In  this  paper  we  provide  a  novel  description  of 
Bayesian  classifiers  which  makes  it  possible  to  per¬ 
form  margin  analysis  on  them,  and  hence  to  apply 
Data-Dependent  VC  theory.  In  particular,  by  view¬ 
ing  the  posterior  distribution  as  a  linear  functional  in 
a  Hilbert  space,  the  margin  can  be  computed  and  gives 
a  bound  on  the  generalization  error  via  an  ‘effective’ 
VC  dimension  which  is  much  lower  than  the  number 
of  parameters. 

Finally,  experimental  study  is  performed  with  a  stan¬ 
dard  bayesian  algorithm  [5]  on  real  world  data,  in  order 
to  test  the  predictions  of  our  model.  The  results  of  the 
experiments  confirm  that  the  model  captures  the  rel¬ 
evant  features  of  these  classifiers,  and  that  they  can 
indeed  be  regarded  as  large  margin  hyperplanes  in  a 
Hilbert  space. 

Margin-distribution  graphs  are  provided  for  different 
data  sets,  different  network  sizes,  committee  sizes  and 
choices  of  prior,  always  showing  the  same  qualitative 
behaviour:  a  clear  bias  toward  large  margin  on  train¬ 
ing  examples. 

Our  plots  can  be  directly  compared  with  the  ones  pre¬ 
sented  in  the  inspiring  paper  by  Shapire  et  al.  [7], 
where  this  concept  was  introduced,  as  we  have  used 
the  same  datasets.  In  that  paper,  a  bound  on  the  test 
error  as  a  function  of  the  margin  distribution  was  first 
obtained. 

These  theoretical  and  experimental  results  not  only  ex¬ 
plain  the  remarkable  resistance  to  evrfitting  observed 
in  bayesian  algorithms,  but  also  provide  a  surprising 
unified  description  of  three  of  the  most  effective  learn¬ 
ing  algorithms:  Support  Vector  Machines,  Adaboost 
and  now  also  Bayesian  classifiers. 

2  BAYESIAN  LEARNING  THEORY 

The  result  of  Bayesian  learning  is  a  probability  distri¬ 
bution  over  the  (parametrized)  hypothesis  space,  ex¬ 
pressing  the  degree  of  belief  in  a  specific  hypothesis  as 
approximation  of  the  target  function.  Such  distribu¬ 
tion  is  then  used  to  make  predictions. 

To  start  the  process  of  bayesian  learning,  one  must 
define  a  prior  distribution  P{w)  over  the  parameter 
space,  possibily  encoding  some  prior  knowledge.  After 
observing  the  data,  the  prior  distribution  is  updated 
using  Bayes’  Rule: 

P{w\D)  oc  P{D\w)P{w), 


where  P{w\D)  is  the  probability  of  the  parameters 
given  the  data  D,  P(D\w)  the  probability  of  the  data 
given  the  parameters,  and  P(w)  the  prior  distribution 
over  the  parameters.  The  posterior  distribution  so  ob¬ 
tained,  hence,  encodes  information  coming  from  the 
training  set  (via  the  likelihood  function  P(D\w))  and 
prior  knowledge. 

To  predict  the  label  of  a  new  point,  bayesian  classifiers 
integrate  the  predictions  made  by  every  element  of  the 
hypothesis  space,  weighting  them  with  the  posterior 
associated  to  each  hypothesis,  obtaining  a  distribution 
of  probability  over  the  set  of  possible  labels  (note  that 
hy,  is  the  function  parametrised  by  w): 

P{y\x,D)=  I  hy,{x)p{w\D)dw 
J  w 

This  predictive  distribution  can  be  used  to  minimize 
the  number  of  misclassifications  in  the  test  set;  in  the 
2-class  case  this  is  achieved  simply  by  outputting  the 
label  which  has  received  the  highest  vote. 

3  BAYESIAN  CLASSIFIERS  AS 
LARGE  MARGIN 
HYPERPLANES 

Hence,  the  actual  hypothesis  space  used  by  Bayesian 
systems  is  the  Convex  Hull  of  H,  rather  than  H.  The 
output  hypothesis  is  a  hyperplane,  whose  coordinates 
are  given  by  the  posterior. 

In  order  to  study  the  margin  of  such  hyperplanes, 
we  will  introduce  some  simplifications  in  the  general 
model.  We  assume  that  the  base  hypothesis  space, 
H  is  formed  by  Boolean  valued  functions,  and  that 
it  is  sufficiently  rich  that  all  dichotomies  can  be  im¬ 
plemented.  Further,  initially  we  will  assume  that  the 
average  prior  probability  over  functions  in  a  particular 
error  shell  does  not  depend  on  the  number  of  errors. 

These  are  the  only  assumptions  we  make,  and  the  sec¬ 
ond  will  to  be  relaxed  in  a  second  stage.  A  natural 
choice  for  the  evidence  function  in  a  Boolean  valued 
hypothesis  space  is  e”*"’,  where  k  is  the  number  of 
mistakes  made  by  the  hypothesis  and  tr  >  0  an  ap¬ 
propriately  chosen  constant.  The  expression  has  the 
required  property  of  giving  low  likelihood  to  the  pre¬ 
dictors  which  make  many  mistakes  on  the  training  set, 
and  to  which  the  usual  Bayesian  evidence  collapses  in 
the  Boolean  case.  Our  analysis  will  also  suggest  suit¬ 
able  choices  for  (t. 

It  can  be  interpreted  with  an  assumption  of  Gaussian 
noise  corrupting  the  data  after  they  have  been  labelled 
by  a  target  function  which  belongs  to  H,  the  variance 
of  the  noise  depending  on  1/cr. 
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The  assumption  that  all  the  dichotomies  can  be  im¬ 
plemented  with  the  same  probability  corresponds  to 
an  ‘uninformative’  prior,  where  no  knowledge  is  avail¬ 
able  about  the  target  function.  In  a  second  stage  we 
will  examine  the  effect  of  inserting  some  knowledge  in 
the  prior,  by  slightly  perturbing  the  uninformative  one 
towards  the  target  hypothesis.  We  wiU  see  that  even 
slightly  favourable  priors  can  give  a  much  smaller  VC 
dimension  than  the  uninformative  one. 


gin  of  the  Bayes  Classifier  is  given  by 


Proof:  Let  the  set  of  training  examples  be 
with  classifications  y  =  (t/i, . . . ,  2/^)  € 
{-1, 1}*”.  Let  the  margin  M  of  example  i  be  M,-. 
Consider  first  the  average  margin 


3.1  THE  UNINFORMATIVE  PRIOR 

The  actual  hypothesis  space  used  by  Bayesian  systems, 
hence,  is  the  Convex  HuU  of  H,  rather  than  H.  The 
output  hypothesis  is  a  hyperplane,  whose  coordinates 
are  given  by  the  posterior. 

In  this  section  we  give  an  expression  for  the  meirgin  of 
the  composite  hypothesis,  as  a  function  of  a  parame¬ 
ter  related  to  our  model  of  likelihood.  The  result  is 
obtained  in  the  case  of  a  uniform  prior,  and  for  the 
pattern  recognition  case. 

Let  us  start  by  stating  some  simple  results  and  defini¬ 
tions  which  win  be  useful  in  the  following. 

Definition  3.1  Let  Bi  be  the  balance  of  the  hypothe¬ 
sis  hi  over  a  given  sample  of  size  m,  that  is  the  num¬ 
ber  of  successes  Si  minus  the  number  of  failures  /,•  ; 
Bi  =  Si-  fi,  m=  Si+fi. 

Therefore  Bi  =  m  —  2fi,  which  implies  Bi/m  =  1  — 2ej, 
where  =  //m  is  the  empirical  error  of  hi. 

During  the  next  proof  we  will  need  to  know  the  prob¬ 
ability  in  the  prior  distribution  of  hypotheses  in  our 
parameter  space  with  a  fixed  empirical  error.  Given 
that  this  information  is  in  general  not  available,  we  will 
initially  make  the  simplifying  assumption  that  all  be¬ 
haviours  on  the  training  sample  can  be  realised.  This 
implies  that  the  hypothesis  space  has  VC  dimension 
greater  than  or  equal  to  the  sample  size  m. 

We  make  the  further  assumption  that  the  prior  prob¬ 
ability  of  hypotheses  which  have  error  e  =  k /m  is 


<">  =  = 

ieS  i&S 

=  —y^Vi  I  ahh{xi)dP{h) 

'^its 

ieS  j€J 

where  hj,  j  E  J  axe  representatives  of  each  possible 
classification  of  the  sample.  We  are  denoting  by  Pj  the 
prior  probability  of  classifiers  agreeing  with  hj.  The 
quantity  Pj  is  the  posterior  probability  of  these  clas¬ 
sifiers,  where  the  coefficient  Uj  = 
is  the  evidence,  which  depends  only  on  the  empirical 
error  and  the  normalising  constant  A.  By  assumption, 
we  have 

ft  error  shell 

Hence, 

jeJ 

—  Z  (1  ~  2ej) 

=  (1) 
j€J 


1 

2”^\k)  2™(m€)!(m  —  me)!  ’ 

in  other  words  that  the  average  prior  probability  for 
functions  realising  different  patterns  of  k  errors  is  2“”*. 
We  wUl  assume  that  the  posterior  distribution  for  a 
hypothesis  which  has  k  training  errors  is  proportional 
to  e~'^’‘  =  C**,  where  C  =  We  are  now  ready  to 
give  the  main  result  of  this  section. 

Theorem  3.2  Under  the  above  assumptions  the  mar¬ 


by  the  observation  concerning  the  balance  Bj  of  hj 
and  the  fact  that  the  posterior  distribution  has  been 
normalised,  that  is  1  =  Jjj  ahdP{h)  =  ^j^jCijPj- 
We  now  regroup  the  elements  of  the  sum  on  the  right 
hand  side  of  the  above  equation  by  decomposing  the 
hypothesis  space  into  error  shells.  Hence,  we  can  write 
the  above  sum  as 
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Solving  for  A  and  substituting,  gives 


jeJ 


We  can  now  use  the  equality  =  (1  +  C)”*, 

and  the  observation  that  C'^  (T)^  written  as 

C-^  Sfc  (T)  ”  to  obtain  the  result 

for  the  average  margin. 

To  complete  the  proof  we  must  show  that  the  average 
margin  is  in  fact  the  minimal  margin.  We  will  demon¬ 
strate  this  by  showing  that  the  margin  of  eJI  points  is 
equal.  Intuitively,  this  follows  from  the  symmetry  of 
the  situation,  there  being  nothing  to  distinguish  be¬ 
tween  different  training  points  in  the  structure  of  the 
hypothesis.  The  formal  proof  relies  on  performing  a 
permutation  on  the  training  points,  but  has  had  to  be 
omitted  in  this  shortened  version.  ■ 


There  are  three  relevant  bounds  on  the  generalization 
error  in  terms  of  the  margin  on  the  training  set.  We 
will  quote  aU  three  here  and  then  discuss  their  appli¬ 
cability  in  the  current  context.  The  first  two  appear 
in  Schapire  et  al.  [7]. 

Following  [7],  let  H  denote  the  space  from  which  the 
base  hypotheses  are  chosen  (for  example  Neural  Net¬ 
works,  or  Decision  Trees).  A  base  hypothesis  h  £  H  is 
a  mapping  from  an  instance  space  X  to  {-1,  -f  1  }. 

Theorem  3.3  Let  S  be  a  sample  of  m  examples  cho¬ 
sen  independently  at  random  according  to  D.  Assume 
that  the  base  hypothesis  space  H  has  VC  dimension  d, 
and  let  be  S  >  0.  Then,  with  probability  at  least  I  —  S 
over  the  random  choice  of  the  training  set  S,  every 
weighted  average  function  f  £  C  satisfies  the  follow¬ 
ing  bound  for  all  6  >  Q: 

PD[yF{x)  <  0]  <  Ps[yF{x)  <  6]-^ 

Theorem  3.4  Let  S  be  a  sample  of  m  examples  cho¬ 
sen  independently  at  random  according  to  D.  Assume 
that  the  base  hypothesis  space  H  is  finite,  and  let  be 
<J  >  0.  Then,  with  probability  at  least  1  —  ^  over  the 
random  choice  of  the  training  set  S,  every  weighted  av¬ 
erage  function  f  £  C  satisfies  the  following  bound  for 
all  6  >  0; 

PD{yF{x)  <  0]  <  P5[yF’(a:)  <  «]  + 


As  observed  by  the  authors,  the  theorem  applies  to 
every  majority  vote  method,  including  boosting,  bag¬ 
ging,  ECOC,  etc. 

The  third  is  contained  in  Shawe-Taylor  etal  [8]  and 
involves  the  fat  shattering  dimension  of  the  space  of 
functions. 

Theorem  3.5  Consider  a  real  valued  function  class 
T  having  fat  shattering  function  bounded  above  by  the 
function  afat  :  M  — >•  N  which  is  continuous  from  the 
right.  Fix  0  £  R.  If  a  learner  correctly  classifies  m 
independently  generated  examples  x  with  h  =  Te(f)  £ 
Tg{F)  such  that  erz(/i)  =  0  and  7  =  min  |/(a;i)  —  9\, 
then  with  confidence  1  —  i  the  expected  error  of  h  is 
bounded  from  above  by 

€{m,k,S)  =  ^  ^fclog  log(32m)  +log  > 

where  k  =  afat(7/8). 

Since  the  assumption  that  the  underlying  hypothesis 
space  can  perform  any  classification  of  the  training  set 
implies  that  its  VC  dimension  is  at  least  m,  we  can¬ 
not  expect  that  learning  is  possible  in  the  situation 
described.  Indeed,  we  have  augmented  the  power  of 
the  hypothesis  space  by  taking  our  functions  from  the 
convex  hull  of  H  which  would  appear  to  make  the  sit¬ 
uation  yet  worse. 

Hence,  in  order  to  obtain  useful  applications  of  any 
of  the  theorems  we  will  need  to  consider  deviations 
from  the  most  general  situation  described  above.  The 
deviation  should  not  have  a  significant  impact  on  the 
margin,  while  reducing  the  expressive  power  of  the  hy¬ 
potheses. 

In  order  to  apply  Theorem  3.4  the  number  of  hypothe¬ 
ses  in  the  base  class  H  must  be  finite.  The  logarithm  of 
the  number  of  hypotheses  appears  in  the  result.  Since 
we  have  assumed  that  all  possible  classifications  of  the 
training  set  can  be  performed  the  number  of  hypothe¬ 
ses  must  be  at  least  2"*  making  the  bound  uninter¬ 
esting.  To  apply  this  theorem  we  must  assume  that 
a  very  large  proportion  of  the  hypotheses  have  zero 
weight  in  the  prior,  while  those  that  have  significant 
weights  in  the  posterior  (i.e.  have  low  empirical  er¬ 
ror)  are  retained.  Making  this  assumption  the  bound 
will  become  significant.  However,  we  are  interested  in 
capturing  the  effect  of  non-discrete  priors,  that  is  sit¬ 
uations  where  potentially  all  of  the  base  hypotheses 
are  included,  but  those  with  high  empirical  error  have 
lower  prior  probability. 

In  order  to  apply  Theorem  3.3  the  underlying  hypothe¬ 
sis  class  H  must  be  assumed  to  have  low  VC  dimension 
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in  such  a  way  that  no  significant  impact  is  made  on 
the  margin.  This  could  be  achieved  by  removing  high 
error  functions.  Note  that  the  functions  would  have 
to  be  removed,  in  other  words  given  prior  probabil¬ 
ity  0.  Hence,  the  bound  obtained  would  be  no  better 
than  a  standard  VC  bound  in  the  original  space.  A 
situation  where  this  approach  and  analysis  might  be 
advantageous  is  where  the  consistent  hypothesis  hy  is 
not  included  in  H.  This  will  reduce  the  margin  by  ap¬ 
proximately  ahy.2“™  =  since  =  m  (see 

equation  (1)).  The  approximation  arises  from  not  ad¬ 
justing  the  normalisation  to  take  account  of  the  miss¬ 
ing  hypothesis  and  is  thus  a  very  small  error. 

These  applications  are  unable  to  take  into  account  the 
prior  distribution  in  a  flexible  way.  In  the  next  section 
we  will  present  an  application  of  the  third  approach  to 
show  how  this  can  take  advantage  of  a  beneficial  prior. 

3.2  THE  EFFECT  OF  THE  PRIOR 

DISTRIBUTION  ON  THE  MARGIN 
BOUND 

We  will  consider  the  situation  where  the  prior  decays 
arithmetically  with  the  error  shells.  In  other  words 
the  prior  on  hypotheses  with  error  k  is  multiplied  by 
a*  for  some  a  <  1.  We  first  repeat  the  calculations  of 
Theorem  3.2  for  this  case.  The  sum  (2)  must  take  into 
account  that  in  this  case 

i:  p,=a*(i+<.r"(r). 

k  error  shell 

The  factor  (1  -|-  a)™  cancels  and  the  factor  a  appears 
wherever  C  appears,  that  is 


Hence,  the  margin  can  be  computed  as 

2aC 

l  +  aC 

We  now  quote  a  theorem  due  to  Gurvits  [2]  that 
bounds  the  fat  shattering  dimension  of  linear  function¬ 
als  in  Banach  spaces  which  we  will  need  to  bound  the 
effective  VC  dimension. 

Theorem  3.6  [2]  Consider  a  Banach  space  B  of  type 
p  and  the  class  of  linear  functions  L  of  norm  less  than 
or  equal  to  one  restricted  to  the  unit  sphere.  Then 
there  is  a  constant  D  such  that  fati(7)  <  . 


Note  that  for  Hilbert  spaces  which  we  will  consider  the 
value  of  p  =  2. 

In  order  to  apply  Theorems  3.5  and  3.6  we  need 
to  bound  the  radius  of  the  sphere  containing  the 
points  and  the  norm  of  the  linear  functionals  involved. 
Clearly,  scaling  by  these  quantities  will  give  the  mar¬ 
gin  appropriate  for  application  of  the  theorem.  The 
Hilbert  space  we  consider  is  that  given  by  the  input 
space  X  with  inner  product 

{x,y)  -  [  h{x)h{y)dP{h). 

Jh 

Hence,  the  norm  of  input  points  is  1  and  they  axe  con¬ 
tained  in  the  unit  sphere  as  required.  The  linear  func¬ 
tionals  considered  are  those  determined  by  the  poste¬ 
rior  distribution.  The  norm  is  given  by 

llalp  =  f  aldPih). 

Jh 


We  must  compute  this  value  for  the  posterior  func¬ 
tional  in  the  prior  described  above.  The  integral  in 
this  case  is  given  by 


j^J  k=0 

(l  +  a)™(l  +  aC2)" 

{l  +  aCy"^ 


Hnece,  the  bound  on  the  fat  shattering  dimension  be¬ 
comes, 


ff(a,  C) 


(1 -k  a)”'(l  + 

(l  +  aC)2"»-2(l-aC7)2’ 


In  the  rest  of  this  section  we  wUI  consider  how  this 
function  behaves  for  various  choices  of  C  and  a,  show¬ 
ing  that  for  careful  choices  of  C,  values  of  a  close  to  1 
can  give  dimensions  significantly  lower  than  m,  hence 
give  good  bounds  on  the  generalization  error.  The 
analysis  shows  that  using  this  approach  it  is  possible 
to  make  use  of  a  beneficial  prior.  At  the  same  time  it 
suggests  a  value  of  C  most  likely  to  take  advantage  of 
such  a  prior. 

First  consider  the  case  when  a  =  1.  Hence, 


5(1,  C) 


2m(i  ^ 


(l  +  C')2m-2(l_C')2- 


The  parameter  C  can  be  chosen  in  the  range  [0, 1). 
However,  g{\,C)  — oo,  while  fif(l,0)  =  2™. 
Clearly,  the  optimal  choice  of  C  needs  to  be  deter¬ 
mined  if  the  bound  is  to  be  useful.  A  routine  calcu¬ 
lation  establishes  that  the  value  of  C  which  minimises 
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the  expression  is,  Cq  =  (m  —  y/m  —  1)/(»ti  —  2),  which 
gives  a  value  of 


)m  — 1 

^  em. 


This  confirms  that  the  effective  VC  dimension  is  not 
increased  excessively  provided  C  is  chosen  around 
1  —  Ijyfm.  In  order  to  study  the  effect  of  allowing 
a  to  move  slightly  below  1,  we  will  perform  a  Taylor 
expansion  about  a  =  1. 

Let  C  =  aC  and  the  function 


gi(Q,C")  :=5(a,C7a)  = 


(l  +  ar(l  +  C'Var 
\\  +  -  cy 


Note  that  ^ 


=  0,  and  so  = 


a  =  l 


6gi{a,C')  I  dgi{a,C')  dC' 
da  eC  da  ■ 

dg{a,Co)  _d9i{a,C') 


da 


da 


a  =  l 


a=l 


Differentiating  gives 


d9i{a,C') 

da 


a=l 


(l  +  (7')2m-8(l  _(7/) 


We  can  now  perform  a  Taylor  series  expansion  of 
9{a,  Cq)  about  a  =  1  to  obtain  y(a,  Co)  «  em(l  + 
(a  -  l)y/Tn  -  1),  where  we  have  omitted  some  routine 
calculations.  Hence,  the  bound  on  the  generalization 
error  is  (ignoring  log  factors)  0(1  -  (1  -  a)y/m  -  1),  so 
that  to  obtain  generalization  error  of  order  e,  we  need 


a  «  1  - 


1  -  e 
y/m  —  1 


Hence,  for  values  of  a  very  close  to  1,  the  prior  can 
result  in  very  good  generalization  properties. 


4  EXPERIMENTS 

In  this  section  we  will  look  at  some  experiments  where 
we  calculated  margin  distributions  for  two  data  sets. 
We  used  the  vehicle  data  and  the  satimage  data,  both 
taken  from  the  StatLog  '  database.  These  datasets 
were  used  by  [7]  for  a  comparison  of  the  margin  distri¬ 
butions  of  Bagging  and  Boosting.  We  used  satimage 
as  provided,  there  are  4435  samples  in  the  training  and 
2000  in  the  test  set.  The  vehicle  data  were  merged,  500 
samples  were  used  for  training  and  252  for  testing. 

*The  data  are  available  via  the  UCI  machine  learning 
repository  at 

http : //hhw . ics .uci . edu/  mlearn/MLRepository .html . 


4.1  EXPERIMENTAL  SETUP 

Both  datasets  are  polychotomous  classification  prob¬ 
lems.  To  arrive  at  a  reasonable  posterior  probability 
density  over  weight  space  besides  a  prior  we  need  a 
proper  data  model  and  likelihood  term. 

According  to  [1],  the  best  thing  we  can  do  in  the  case 
of  polychotomous  classification  is  to  use  (3),  the  gen¬ 
eralized  logistic  or  softmax  transformation  of  the  out¬ 
put  layer  activations.  Given  distributions  of  hidden 
unit  activations,  which  are  members  of  the  exponential 
family,  this  transformation  guarantees  that  the  net¬ 
work  outputs  may  be  interpreted  as  probabilities  for 
classes. 

_  exp(afc) 

Ei'exp(afc-) 

In  (3)  the  value  a*  is  the  value  at  output  node  k  before 
applying  softmax  activation. 

Having  sampled  a  sufficient  number  of  weights  we  are 
ready  to  predict.  In  a  Bayesian  framework  each  in¬ 
put  value  leads  to  a  predictive  distribution  of  network 
outputs.  In  the  case  of  classifications,  the  network  out¬ 
put  is  simply  given  by  integrating  over  the  predictive 
distribution.  Having  sampled  from  the  posterior  over 
weights,  in  our  case  the  expectation  is  approximated 
by  a  sum  over  the  weights. 

The  experiments  were  performed  for  both  datasets 
with  different  settings.  Initially  we  sampled  600 
weights  using  the  standard  method  without  ARD- 
priors  (Automatic  Relevance  Determination  [3]).  The 
network  size  was  fixed  to  25  hidden  units  for  both 
datasets.  This  experiment  was  used  to  investigate  the 
dependence  of  the  margin  distribution  of  the  number 
of  weights  used  to  represent  the  posterior.  Discarding 
50  initial  weights,  we  calculated  the  margin  distribu¬ 
tion  of  a  committee  consisting  of  the  next  150  weights 
and  compared  it  to  the  margin  distribution  when  using 
all  550  remaining  weights. 

To  assess  whether  the  margin  distribution  changes 
while  increasing  the  size  of  the  network,  we  per¬ 
formed  two  further  experiments  sampling  150  weights 
for  a  network  with  50  and  200  hidden  units  respec¬ 
tively,  again  using  conventional  priors  without  ARD. 
A  fourth  experiment  should  reveal  the  influence  of  an 
ARD- prior  on  the  margin  distribution.  We  sampled 
150  weights  for  a  network  with  25  hidden  units  using 
an  ARD-prior  on  the  input  to  hidden  layer  weights. 
Figure  1  shows  plots  of  the  resulting  margin  distribu¬ 
tions  for  the  vehicle  dataset.  The  margin  distributions 
for  the  satimage  data  are  shown  in  Figure  2.  Look¬ 
ing  at  the  plots  of  the  margin  distributions,  we  see 
that  they  are  different.  It  is  interesting  to  investigate 
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whether  these  differences  are  significant  and  whether 
the  differences  in  the  margin  distributions  are  corre¬ 
lated  with  the  performance  of  the  classifier  on  an  in¬ 
dependent  test  set.  Prom  theory  we  expect  that  a  clas¬ 
sifier  which  shows  larger  margins  on  the  training  data 
should  also  show  a  better  generalization  error. 

For  both  experiments  with  the  200  hidden  units  net¬ 
works  we  see  a  trend  towards  lower  margins.  This 
fact  can  be  understood  when  remembering  that  the 
prior  variance  of  the  hidden  to  output  weights  scales 
inversely  with  the  number  of  hidden  units.  Increas¬ 
ing  the  number  of  hidden  units  forces  smaller  hidden 
to  output  weights  which  leads  to  a  smaller  complex¬ 
ity  of  the  network  and  therefore  to  underfitting  and 
increased  errors  on  the  training  set. 

4.2  RESULTS 

In  order  to  compare  the  margin  distribution  with  the 
generalization  error,  we  used  each  classifier  to  predict 
class  labels  on  an  independent  test  set.  The  different 
experimental  setups  and  the  resulting  generalization 
errors  are  summarized  in  table  1. 

Table  1:  Network  size,  information  about  prior  dis¬ 
tribution,  committee  size,  and  generalization  error  for 
satimage  (sat)  and  vehicle  (veh)  data. 


25 

no 

r(0.05,0.5) 

150 

9.2% 

15.5% 

25 

no 

r(0.05,0.5) 

550 

8.9% 

14.7% 

50 

no 

r(0.05,0.5) 

150 

8.6% 

13.5% 

200 

no 

r(0.05,0.5) 

150 

7.7% 

24.2% 

25 

yes 

r(0.05,0.5) 

150 

9.7% 

17.5% 

In  order  to  test  our  hypothesis  that  a  better  perfor¬ 
mance  on  the  test  set  is  indicated  by  larger  margins 
on  the  training  data,  we  will  use  the  first  experiment  as 
reference  and  compare  its  margin  distribution  with  the 
margin  distributions  of  the  second  to  fifth  experiment. 
Four  one  sided  t-tests  were  used  to  assess  whether  the 
observed  differences  of  means  are  significant.  Assum¬ 
ing  independent  individual  experiments,  this  approach 
suffers  from  the  fact  that  the  risk  of  having  incorrectly 
rejected  one  of  the  hypothesis  is  as  large  as  the  sum  of 
the  individual  significance  levels.  In  this  case  we  get 
no  problem  because  each  experiment  was  highly  signif¬ 
icant.  In  table  2  we  show  the  generalization  error,  the 
means  of  the  margin  distributions.  We  expect  that 


Table  2:  Generalization  error  and  margin  distributions 


Satimage  data 

Vehicle  data 

Error 

Mean  margin 

Error 

Mean  margin 

9.2% 

0.929 

15.5% 

0.73 

8.9% 

0.932 

14.7% 

0.72 

8.6% 

0.926 

13.5% 

0.78 

7.7% 

0.898 

24.2% 

0.45 

9.7% 

0.895 

17.5% 

0.70 

larger  mean  values  of  the  margin  distribution  corre¬ 
spond  to  smaller  generalization  errors.  Looking  at  the 
satimage  experiments,  we  see  that  this  is  true  for  the 
large  committee  experiment  and  for  the  ARD-prior  ex¬ 
periment  when  compared  to  the  first  experiment.  For 
the  vehicle  data  we  see  the  expected  correlation  for 
both  large  network  scenarios  and  for  the  ARD-prior 
experiment  again  comparing  with  the  results  of  the 
first  experiment. 

5  CONCLUSIONS 

Our  theoretical  analysis  and  experimental  results  show 
that  Bayesian  Classifiers  of  the  kind  described  in  [3] 
can  be  regarded  as  large  margin  hyperplanes  in  a 
Hilbert  space,  and  consequently  can  be  analysed  with 
the  tools  of  Data-Dependent  VC  theory. 

The  non-linear  mapping  from  the  input  space  to  the 
Hilbert  space  is  given  by  the  initial  choice  of  network 
architecture,  while  the  coordinates  of  the  hyperplane 
are  given  by  the  Bayes’  posterior  and  hence  depend 
both  on  the  training  data  and  on  the  chosen  prior. 

The  choice  of  the  prior  turns  out  to  be  a  crucieil 
one,  since  we  have  shown  how  even  slightly  correctly 
guessed  priors  can  be  translated  into  a  much  lower  VC 
dimension  of  the  resulting  classifier  (and  this  -  coupled 
with  high  training  accuracy  -  ensures  good  general¬ 
ization).  But  even  with  a  totally  uninformative  prior 
there  is  at  least  no  harm  in  using  these  apparently 
overcomplex  systems. 

Experiments  performed  on  real  world  data  confirm  the 
predictions  of  the  model,  highlighting  a  strong  bias 
toward  large  margins  in  aU  experimental  conditions 
and  with  different  data  sets.  Their  correlation  with 
test  error  has  also  been  studied. 

The  practical  utility  of  VC  bounds,  however,  does  not 
lie  in  quantitative  predictions  of  the  test  error  (the 
price  for  their  universality  is  often  a  certain  looseness), 
but  rather  in  providing  an  analytical  expression  of  the 
test  error  which  can  be  used  to  study  the  role  of  the  dif- 
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ferent  parameters  and  design  choices  on  the  final  per¬ 
formance.  Also,  via  the  SRM  principle,  such  bounds 
provide  a  theoretically  sound  indicator  of  performance. 
The  results  obtained  in  this  work  can  be  incorporated 
in  actual  learning  systems,  to  provide  for  example  an 
independent  stopping  criterion:  the  VC  bound  on  the 
error  could  be  calculated  during  the  learning,  and  the 
training  could  be  stopped  when  no  significant  increase 
in  performance  is  observed.  Also,  the  other  choices 
like  net  size,  committee  size,  type  of  prior,  could  be 
performed  using  as  a  guideline  their  effect  on  the  mar¬ 
gin. 

On  the  theoretical  side,  the  surprising  result  of  this 
paper  is  to  co-locate  Bayesian  Classifiers  in  the  same 
category  of  other  systems  -  namely  Support  Vector 
Machines  and  Adaboost  -  which  were  motivated  by 
very  different  considerations  but  which  exhibited  very 
similar  behaviours  (e.g.  with  respect  to  overfitting). 

A  unified  analysis  of  the  three  systems  is  now  possi¬ 
ble,  which  can  make  potentially  fruitful  comparisons 
or  cross-fertilizations  much  easier. 
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Margin  distribution:  150  natwortcs,  25  hidden  units,  no  ARD 


Figure  1:  Plot  of  margin  distribution  of  the  vehicle 
data.  The  different  experimentEil  setups  lead  to  differ¬ 
ent  margin  distributions.  Further  investigations  show 
that  these  differences  are  highly  significant.  Using  the 
first  experiment  as  reference,  the  third  to  fifth  margin 
distribution  indicate  the  correct  trend  in  the  general¬ 
ization  error  for  the  third  to  fifth  classifier  respectively, 
whereas  the  conclusion  we  would  draw  from  the  second 
margin  distribution  is  misleading. 


Margin  distribution:  150  networks,  25  hidden  units,  no  ARD 


Figure  2:  Plot  of  margin  distribution  of  the  satim- 
age  data.  Also  in  this  case  we  get  different  margin 
distributions.  Again  using  the  first  experiment  as  ref¬ 
erence,  the  margin  distributions  of  these  experiments 
allow  to  predict  the  correct  trend  of  the  generalization 
performance  for  the  second  and  fifth  experiment.  The 
conclusion  of  the  third  and  fourth  margin  distribution 
which  indicates  worse  generalization  performance  com¬ 
pared  to  the  first  experiment  is  again  misleading. 
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Abstract 

This  paper  presents  a  new  approach  to  hier¬ 
archical  reinforcement  learning  based  on  the 
MAXQ  decomposition  of  the  value  function. 

The  MAXQ  decomposition  has  both  a  procedu¬ 
ral  semantics — as  a  subroutine  hierarchy — and  a 
declarative  semantics — as  a  representation  of  the 
value  function  of  a  hierarchical  policy.  MAXQ 
unifies  and  extends  previous  work  on  hierarchical 
reinforcement  learning  by  Singh,  Kaelbling,  and 
Dayan  and  Hinton.  Conditions  under  which  the 
MAXQ  decomposition  can  represent  the  optimal 
value  function  are  derived.  The  paper  defines  a 
hierarchical  Q  learning  algorithm,  proves  its  con¬ 
vergence,  and  shows  experimentally  that  it  can 
learn  much  faster  than  ordinary  “flat”  Q  learn¬ 
ing.  Finally,  the  paper  discusses  some  interest¬ 
ing  issues  that  arise  in  hierarchical  reinforcement 
learning  including  the  hierarchical  credit  assign¬ 
ment  problem  and  non-hierarchical  execution  of 
the  MAXQ  hierarchy. 

1  Introduction 

Hierarchical  approaches  to  reinforcement  learning  (RL) 
problems  promise  many  benefits:  (a)  improved  exploration 
(because  exploration  can  take  “big  steps”  at  high  levels  of 
abstraction),  (b)  learning  from  fewer  trials  (because  fewer 
parameters  must  be  learned  and  because  subtasks  can  ig¬ 
nore  irrelevant  features  of  the  full  state)  and  (c)  fa.ster  learn¬ 
ing  for  new  problems  (because  subtasks  learned  on  previ¬ 
ous  problems  can  be  re-used). 

Recent  research  has  explored  three  general  approaches  to 
reaching  these  goals.  The  first  approach,  introduced  by 
Dean  and  Lin  (199.5),  exploits  a  hierarchical  decomposi¬ 
tion  primarily  as  a  computational  device  to  accelerate  the 


computation  of  the  optimal  policy.  The  second  approach, 
introduced  by  Parr  and  Russell  (1998)  relies  on  a  program¬ 
mer  to  design  a  hierarchy  of  abstract  machines  that  con¬ 
strains  the  possible  policies  to  be  considered.  Their  method 
computes  the  policy  that  is  optimal  subject  to  these  hier¬ 
archical  constraints  by  effectively  flattening  the  hierarchy. 
We  will  call  this  kind  of  policy  hierarchically  optimal,  be¬ 
cause  it  is  the  best  policy  consistent  with  the  imposed  hi¬ 
erarchy.  The  third  approach,  pioneered  by  Singh  (1992), 
Kaelbling  (1993),  and  Dayan  and  Hinton  (1993),  also  re¬ 
lies  on  a  programmer-designed  hierarchy.  In  this  hierarchy, 
each  subtask  is  defined  in  terms  of  goal  states  or  termina¬ 
tion  conditions.  Each  subtask  in  the  hierarchy  corresponds 
to  its  own  Markov  Decision  Problem  (MDP),  and  the  meth¬ 
ods  seek  to  compute  a  policy  that  is  locally  optimal  for  each 
subtask.  We  will  call  such  policies  recursively  optimal.  Re¬ 
cent  work  by  Precup,  Sutton,  and  Singh  (1998)  studies  as¬ 
pects  of  both  the  first  third  approaches. 

In  this  paper,  we  extend  the  research  on  recursively  opti¬ 
mal  policies  by  introducing  the  MAXQ  method  for  hier¬ 
archical  reinforcement  learning.  The  methods  introduced 
by  Singh,  Kaelbling.  and  Dayan  and  Hinton  are  all  spe¬ 
cific  to  particular  tasks.  The  Feudal  Q  learning  method 
of  Dayan  and  Hinton  suffers  from  the  problem  that  at  all 
non-primitive  levels  of  a  Feudal-Q  hierarchy,  the  learning 
task  can  become  non-Markovian,  and  therefore  difficult  to 
solve.  In  contrast,  the  MAXQ  method  is  general  purpose. 
At  each  level  of  the  hierarchy,  the  task  is  Markovian  and 
can  be  solved  by  standard  RL  methods.  In  many  cases, 
state  abstractions  can  be  introduced  without  destroying  the 
optimality  of  the  learned  policy.  Like  Kaclbling’s  work, 
MAXQ  supports  non-hierarchical  execution  of  the  learned 
policy,  which  permits  it  to  behave  well  even  when  the  opti¬ 
mal  policy  violates  the  structure  of  the  hierarchy. 

This  paper  is  organized  as  follows.  First,  we  introduce  the 
MAXQ  hierarchy  using  an  example  and  define  its  procedu¬ 
ral  and  declarative  semantics.  Then  we  introduce  two  theo- 
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Figure  1 :  The  Taxi  Domain 

rems  that  describe  the  conditions  under  which  the  MAXQ 
hierarchy  can  successfully  represent  the  value  function  of 
a  fixed  hierarchical  policy.  Section  4  introduces  a  learning 
algorithm  for  training  a  MAXQ  hierarchy  and  shows  ex¬ 
perimentally  and  theoretically  that  it  works  well.  Finally, 
the  paper  shows  how  a  non-hierarchical  policy  can  be  com¬ 
puted  and  executed  using  the  MAXQ  hierarchy. 

2  The  MAXQ  Hierarchy 

We  will  introduce  the  MAXQ  method  using  the  simple  Taxi 
Problem  shown  in  Figure  1.  A  taxi  inhabits  a  5-by-5  grid 
world.  There  are  four  specially-designated  locations  in  this 
world,  marked  as  R(ed),  B(lue),  G(reen),  and  Y(ellow). 
The  taxi  problem  is  episodic.  In  each  episode,  the  taxi  starts 
in  a  randomly-chosen  state  and  with  a  randomly-chosen 
amount  of  fuel  (ranging  from  5  to  12  units).  There  is  a 
passenger  at  one  of  the  four  locations  (chosen  randomly), 
and  that  passenger  wishes  to  be  transported  to  one  of  the 
four  locations  (also  chosen  randomly).  The  taxi  must  go  to 
the  passenger’s  location  (the  “source”),  pick  up  the  passen¬ 
ger,  go  to  the  destination  location  (the  “destination”),  and 
put  down  the  passenger  there.  (To  keep  things  uniform,  the 
taxi  must  pick  up  and  drop  off  the  passenger  even  if  he/she 
is  already  located  at  the  destination!)  The  episode  ends 
when  the  passenger  is  deposited  at  the  destination  location. 

There  are  seven  primitive  actions  in  this  domain;  (a)  four 
navigation  actions  that  move  the  taxi  one  square  North, 
South,  East,  or  West  (each  of  these  consumes  one  unit  of 
fuel),  (b)  a  Pickup  action,  (c)  a  Putdown  action,  and  (d)  a 
Fillup  action  (which  can  only  be  executed  when  the  taxi  is 
at  location  F(uel)).  Each  action  is  deterministic.  There  is 
a  reward  of  —  1  for  each  action  and  an  additional  reward  of 
-1-20  for  successfully  delivering  the  passenger.  There  is  a 
reward  of  —10  if  the  taxi  attempts  to  execute  the  Putdown 
or  Pickup  actions  illegally.  If  a  navigation  action  would 
cause  the  taxi  to  hit  a  wall,  the  action  is  a  no-op,  and  there 
is  only  the  usual  reward  of  —1.  Finally,  the  episode  also 
ends  (with  a  reward  of  —20)  if  the  fuel  level  falls  below 


zero. 

We  seek  a  policy  that  maximizes  the  average  reward  per 
step.  In  this  domain,  this  is  equivalent  to  maximizing  the 
total  reward  per  episode.  The  optimal  policy — which  is 
non-trivial  to  implement  by  hand — attains  an  average  re¬ 
ward  per  step  of  0.92  (computed  over  5000  trials).  There 
are  8,750  possible  states:  25  squares,  5  locations  for  the 
passenger  (counting  the  four  starting  locations  and  the 
taxi),  5  destinations,  and  14  fuel  levels. 

This  task  has  a  simple  hierarchical  structure  in  which  there 
are  three  sub-tasks:  Get  the  passenger.  Refuel  the  taxi,  and 
Deliver  the  passenger.  Each  subtask  involves  navigating 
to  one  of  the  five  locations  and  then  performing  a  Pickup, 
Fillup,  or  Putdown  action.  While  the  taxi  is  navigating  to 
a  location,  only  that  location  is  relevant.  We  would  like  to 
capture  this  hierarchical  structure  and  take  advantage  of  it 
during  learning  and  performance. 

Figure  2  shows  a  MAXQ  graph  for  this  problem.  This 
graph  contains  two  kinds  of  nodes:  Max  nodes  (indicated 
by  triangles)  and  Q  nodes  (indicated  by  ovals).  Max  nodes 
with  no  children  denote  primitive  actions  in  the  domain; 
Max  nodes  with  children  represent  subtasks.  In  this  sim¬ 
ple  problem,  there  are  five  such  subtasks:  (a)  Navigate(f) 
(move  the  taxi  to  target  location  t),  (b)  Get  (move  to  the 
passenger’s  location  and  pick  up  the  passenger),  (c)  Put 
(move  to  the  passenger’s  destination  and  put  down  the  pas¬ 
senger),  (d)  Refuel  (move  to  F  and  Fillup),  and  (e)  Root 
(perform  the  overall  task  of  picking  up  and  delivering  the 
passenger).  Notice  that  the  Navigate  task  is  shared  by  the 
Get,  Put,  and  Refuel  tasks. 

The  immediate  children  of  each  Max  node  are  Q  nodes. 
Each  Q  node  represents  an  action  that  can  be  performed 
to  achieve  its  parent’s  subtask.  For  example,  the  MaxGet 
node  has  a  child  QNavigateForGet  which  represents  the 
action  of  navigating  from  the  current  state  to  the  passen¬ 
ger’s  location.  The  distinetion  between  Max  nodes  and  Q 
nodes  is  critical  to  ensuring  that  subtasks  can  be  shared  and 
reused.  Each  Max  node  will  learn  the  context  independent 
expected  cumulative  reward  of  performing  its  subtask.  For 
example,  MaxNavigate(f)  will  estimate  the  expected  cu¬ 
mulative  reward  of  navigating  from  any  state  to  one  of  the 
five  target  locations  t.  Each  Q  node  will  learn  the  con¬ 
text  dependent  expected  cumulative  reward  of  performing 
its  subtask.  For  example,  QNavigateForGet(f)  will  learn 
the  expected  cumulative  reward  of  navigating  to  location 
t  and  then  completing  the  Get  task.  On  the  other  hand, 
QNavigateForPut(r)  will  learn  the  expected  cumulative  re¬ 
ward  of  navigating  to  location  t  and  then  completing  the 
Put  task.  Both  of  these  Q  nodes  will  “ask”  MaxNavigate(t) 
how  much  it  will  cost  to  get  to  location  t,  and  they  will  use 
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this  to  help  them  compute  their  Q  values.  The  value  func¬ 
tion  computed  by  MaxNavigate  is  context  independent  and 
can  be  shared  by  all  three  of  its  parent  Q  nodes. 

In  rest  of  the  paper,  we  will  say  that  Max  node  a  is  the 
child  of  Max  node  i  if  there  is  a  Q  node  whose  parent  is  i 
and  whose  child  is  a. 

To  define  the  semantics  of  the  MAXQ  graph  more  formally, 
let  us  suppose  that  the  overall  task  is  to  solve  a  Markov 
Decision  Problem  (MDP)  M  defined  over  a  set  of  states  5 
and  actions  A  with  reward  function  R{s'\s,a)  (the  reward 
received  upon  entering  state  s'  after  performing  action  a 
in  state  5)  and  transition  probability  function  P(j'|s:,a)  (the 
probability  of  entering  state  s'  as  a  result  of  performing  a  in 
5).  In  this  paper,  we  will  assume  that  the  MDP  M  defines  an 
undiscounted  stochastic  shortest  path  problem.  All  of  the 
results  can  be  extended  to  the  infinite-horizon  di.scounted 
case. 

Each  Max  node  i  corresponds  to  a  separate  subtask  A/,  .  The 
children  of  Max  node  i  are  the  actions  of  A/,  .  Each  subtask 
Mi  divides  the  set  5  of  all  states  into  two  disjoint  subsets: 
Si  and  T,.  The  set  7;  is  the  set  of  terminal  states  for  Af,-. 


Subtask  Mi  will  terminate  whenever  the  environment  enters 
one  of  the  states  in  7].  A  subset  G;  C  7]-  of  the  terminal 
states  are  the  goal  states  of  Mi.  Below,  we  will  discuss  the 
details  of  defining  a  reward  function  that  will  encourage 
Mi  to  terminate  in  one  of  these  goal  states.  Let  us  define 
n,  to  be  some  (arbitrary)  policy  for  subtask  i.  This  policy 
“attempts”  to  get  from  any  state  in  S,-  to  one  of  the  goal 
states  in  G,. 

A  hierarchical  policy  for  a  MAXQ  graph  is  a  set  of  poli¬ 
cies  7t  =  {tio,...  one  for  each  Max  node,  that  indicate 
how  each  Max  node  should  choose  its  actions.  The  hierar¬ 
chical  policy  is  executed  the  same  way  that  subroutines  arc 
executed  in  ordinary  programming  languages.  The  Root 
policy  chooses  one  of  its  child  actions  to  perform,  say.  Get. 
The  Get  policy  then  chooses  one  of  its  child  actions,  say. 
Pickup.  Then  the  Pickup  action  is  executed,  since  it  is  a 
primitive.  A  Max  node’s  policy  is  executed  until  that  Max 
node  enters  a  terminating  state,  at  which  point,  “control” 
returns  to  its  parent  Max  node. 

Therefore,  we  can  view  the  MAXQ  graph  as  a  subroutine 
call  graph.  Like  subroutines.  Max  nodes  can  be  parame¬ 
terized.  In  this  graph,  MaxNavigate  takes  one  parameter. 
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t,  which  specifies  which  of  the  five  locations  (R,  B,  G,  Y, 
F)  is  the  target  of  the  MaxNavigate.  One  way  in  which  the 
graph  is  different  from  an  ordinary  program  is  that  the  chil¬ 
dren  of  each  Max  node  are  unordered.  They  can  be  called 
in  any  order,  and  a  Max  node  can  execute  each  of  its  chil¬ 
dren  multiple  times  before  it  completes  its  subtask.  The 
MAXQ  graph  is  therefore  a  kind  of  incompletely-specified 
non-deterministic  program.  One  result  of  learning  will  be 
to  determine  a  policy  for  each  Max  node  that  tells  how  and 
when  to  invoke  its  children.  This  will  make  the  MAXQ 
graph  a  completely-specified  deterministic  program  (inter¬ 
acting  with  a  non-deterministic  environment). 

Thus  far,  our  formulation  of  the  MAXQ  method  is  essen¬ 
tially  the  same  as  the  Feudal  Q  learning  method  of  Dayan 
and  Hinton  (1993).  However,  an  important  improvement 
over  Feudal  Q  learning  is  the  ability  to  interpret  the  MAXQ 
graph  as  a  representation  of  the  value  function  for  a  hierar¬ 
chical  policy.  Consider  Max  node  i,  and  define  Vf^(s)  to  be 
the  expected  cumulative  reward  for  following  the  hierarchi¬ 
cal  policy  Tt  starting  in  state  s  until  we  enter  some  state  in 
Ti-  For  a  fixed  hierarchical  policy  n,  subtask  A/,-  has  a  well- 
defined  transition  probability  function  Pf{s'\s,a),  which  is 
the  probability  that  the  environment  will  move  from  state 
s  to  state  s'  when  M,  executes  action  a.  This  probability 
is  well  defined,  because  the  child  Ma  is  executing  a  fixed 
policy  Ka  (as  are  all  of  its  descendants).  Hence,  node  i  can 
treat  action  a  as  an  atomic  action.  The  immediate  reward 
for  node  i  of  executing  a  will  be  the  expected  reward  for 
node  a  of  moving  from  the  current  state  j  to  a  terminal  state 
in  Ta  according  to  policy  Ha-  This  is  denoted  Yj'(j).  Hence, 
we  can  write 

Vris)  =  V^is)+J^Pi{s'\s,a)Vr{s'),  (1) 

s' 

where  a  =  7t,(i).  This  gives  us  a  recursive  decomposition 
of  the  value  function  so  that  the  value  function  of  the  root 
node  is  the  value  function  of  the  entire  MDP  M  and  each 
subtask  Mi  is  a  separate  MDR 

This  recursive  expression  becomes  more  useful  when  we 
switch  to  the  action-value  (or  “Q”)  representation  of  the 
value  function.  Define  Qf{s,a)  to  be  the  expected  cumu¬ 
lative  reward  for  MDP  M;  of  performing  action  a  in  state  5 
and  then  following  the  hierarchical  policy  n  thereafter.  De¬ 
fine  the  second  term  on  the  right-hand  side  of  Eq.  (1)  to  be 
Cf(s,a),  which  we  will  call  the  completion  function.  This 
is  the  expected  cumulative  reward  of  completing  MDP  M; 
following  policy  n  after  executing  action  a  in  state  s.  With 
these  definitions,  we  can  rewrite  Eq.  (1)  as 

(2) 


where 

vr.(A-i  /composite 

'  1  'L^P{^\s,i)^^is'\s,i)  /primitive  ^  ' 

Cf(^,a)  =  X/’(^>,a)V;."(/)  (4) 

These  completely  define  the  value-function  semantics  of 
the  MAXQ  hierarchy.  Each  Q  node  with  parent  /  and  child 
a  stores  the  information  Cf(s,a)  for  each  state  s  in  5/.  Each 
Max  node  /  returns  the  Q  value  of  the  child  chosen  by  Jt,-. 

To  compute  the  value  of  a  hierarchical  policy  n  in  state  s, 
we  begin  at  MaxRoot  (node  0)  and  compute  Qg(s,7Co(s)). 
This  requires  that  we  ask  our  child  node  ai  =  7to(i)  for 
its  value  (s).  Our  child  recursively  asks  its  child  02  = 
Jia,  (s)  for  its  value,  and  so  on  until  a  leaf  node  a„  is  reached. 
Let  (ai,a2,...  ,a„)  be  the  path  that  was  traversed  through 
the  MAXQ  graph.  Now  leaf  node  a„  returns  V^^{s),  to 
which  its  parent  adds  C^^_^{s,a„)  and  so  on  recursively. 
The  value  returned  by  MaxRoot  is 

M  + . . . + c,  M + CSM 

(5) 

Figure  3  shows  how  the  sequence  of  rewards  ri,r2,...  re¬ 
ceived  from  the  primitive  actions  is  decomposed  hierarchi¬ 
cally  into  the  sum  of  the  C  terms. 

3  Representation  Theorems 

Under  what  conditions  can  this  hierarchy  represent  the 
value  function  of  a  fixed,  hierarchical  policy?  We  will  say 
that  a  MAXQ  graph  is  a  full-state  graph  if  separate  Cf(s,a) 
values  are  stored  for  each  state  s  G  5,-.  In  most  applications, 
including  Figure  1,  it  will  be  desirable  to  introduce  an  ab¬ 
straction  function  Xi(s)  that  will  provide  a  set  of  features 
that  abstract  essential  information  from  the  state.  Each 
Q  node  will  then  store  the  function  Cf(Xi(s),a),  with  one 
value  for  each  distinct  abstract  state  Xi(s). 

For  full-state  graphs,  it  is  easy  to  prove  the  following  theo¬ 
rem  by  expanding  Equations  (2-4): 

Theorem  1  Let  n=  {ni;i  =  0,...,n}  bea  hierarchical  pol¬ 
icy  defined  over  a  full- state  MAXQ  graph  and  let  i  =  0  be 
the  root  node  of  the  graph.  Then  there  exist  values  for  Q 
(for  internal  Max  nodes)  and  Vj  (for  primitive,  leaf  Max 
nodes)  such  that  Vb('y)  is  the  expected  cumulative  reward  of 
following  policy  K  in  state  s. 

A  more  important  and  difficult  question  is  to  understand 
the  conditions  under  which  an  abstract-state  MAXQ  graph 
can  exactly  represent  the  value  function  of  a  hierarchical 
policy.  The  following  theorem  establishes  one  condition: 


Qf(s,a)  =  V^{s)-^-Cf{s,a) 
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Figure  3:  The  MAXQ  decomposition;  r| , . . . ,  ru  denote  the  sequence  of  rewards  received  from  primitive  actions  at  times  1 , . . . ,  14. 


Theorem  2  For  all  Max  nodes  i  and  actions  a,  let 
Resultf{s,a)  =  {5'|/^?'(5'l5,a)  >  0}  be  the  set  of  states  that 
can  result  from  applying  abstract  action  a  in  state  s  at  node 
i  while  following  hierarchical  policy  n.  If  the  following 
condition  holds,  then  the  MAXQ  graph  with  abstraction 
functions  Xi(s,a)  can  represent  the  value  function  of  any 
policy  It  whose  value  function  can  be  represented  by  the 
MAXQ  graph  with  no  abstraction  functions: 


For  all  Max  nodes  i,  actions  a,  states  s  €  S,-. 
and  distinct  states  S[,S2  €  Resultf{s,a)  whenever 
Cl^(si,a)  ^Cf(s2,a)  it  is  the  case  that  Xi{s\, a)  ^ 
Xi{s2,a) 


ward  function.  When  each  reward  is  generated,  a  marker 
is  attached  that  indicates  which  Q  nodes  are  potentially  re¬ 
sponsible  for  this  reward.  For  the  -20  empty  fuel  penalty, 
the  QGet,  QPut,  and  QRefuel  nodes  are  held  responsible, 
because  their  parent,  MaxRoot,  must  compare  their  Q  val¬ 
ues  to  decide  when  to  refuel  to  avoid  the  penalty.  Their  C 
functions  must  therefore  be  able  to  represent  the  rewards. 

This  requires  a  change  to  the  deeomposition  equations.  Let 
be  the  portion  of  the  reward  that  is  assigned  to 
node  i.  Then  we  write  the  following: 

cf{s,a) = (6) 

.v' 


In  other  words,  if  an  abstraction  function  X,  treats  a  pair  of 
result  states  si  and  52  as  identical,  then  their  un-abstracted 
values  must  be  equal.  Otherwise,  the  value  function  cannot 
be  properly  represented.  The  four  children  of  MaxNavigate 
all  satisfy  this  condition.  The  expected  reward  of  complet¬ 
ing  the  MaxNavigate  action  depends  only  on  the  current  lo¬ 
cation  of  the  taxi,  the  target  location,  and  the  amount  of 
fuel  remaining.  If  we  are  navigating  to  F  (for  refueling), 
for  example,  the  expeeted  reward  does  not  depend  on  the 
source  or  destination  locations. 


/  /composite 

1  /primitive 


(7) 


In  many  domains,  we  believe  it  will  be  easy  for  the  de¬ 
signer  of  the  hierarchy  to  also  decompose  the  reward  func¬ 
tion.  However,  an  interesting  problem  for  future  research 
is  to  develop  algorithms  for  autonomously  solving  the  hi¬ 
erarchical  credit  assignment  problem. 


4  A  Learning  Algorithm 


The  introduction  of  abstractions  can  create  a  hierarchical 
credit  assignment  problem.  For  example,  in  our  imple¬ 
mentation,  we  used  only  the  taxi  location  and  the  target 
location  to  represent  the  C  functions  for  QNorth,  QSouth, 
QEast,  and  QWest.  We  wanted  these  nodes  to  learn  a  nav¬ 
igation  policy  that  was  independent  of  how  much  fuel  re¬ 
mained.  But  this  means  that  when  the  fuel  is  exhausted  and 
a  -20  penalty  is  received,  these  Q  nodes  cannot  represent 
the  reason  for  this  penalty!  This  is  the  hierarchical  credit 
assignment  problem:  to  determine  which  node  is  respon¬ 
sible  for  a  reward  that  is  received.  Our  solution  is  for  the 
designer  of  the  MAXQ  hierarchy  to  also  decompose  the  re¬ 


Thc  preceding  section  has  shown  that  the  hierarchy  can  cor¬ 
rectly  represent  the  value  function  of  any  hierarchical  pol¬ 
icy  if  the  full  state  is  employed  to  represent  the  C/  func¬ 
tion  in  each  node  /.  Hence,  we  could  apply  Parr  and  Rus¬ 
sell’s  HAM-Q  algorithm  to  learn  the  best  hierarchical  pol¬ 
icy.  However,  because  we  are  committed  to  employing 
state  abstractions,  we  have  chosen  instead  to  develop  a  rein¬ 
forcement  learning  algorithm  for  finding  a  recursively  op¬ 
timal  policy. 

It  turns  out  that  in  general  there  can  be  many  different  re¬ 
cursively  optimal  policies,  and  that  some  of  them  aehieve 


The  MAXQ  Method  for  Hierarchical  Reinforcement  Learning  123 


better  expected  rewards  than  others.  The  problem  is  that 
a  subtask  may  have  many  policies  that  are  locally  optimal, 
but  some  of  them  are  more  useful  than  others  for  the  over¬ 
all  task.  For  example,  suppose  we  changed  the  taxi  domain 
so  that  if  the  taxi  hits  a  wall,  the  trial  is  terminated  with  a 
reward  of  -5.  Then  for  MaxNavigate(f),  if  the  target  loca¬ 
tion  t  is  more  than  5  steps  away,  the  locally  optimal  policy 
would  be  to  hit  a  wall.  This  would  not  be  part  of  any  hi¬ 
erarchically  optimal  policy,  however!  Dayan  and  Hinton 
faced  this  same  problem,  and  they  solved  it  by  providing  a 
penalty  of  10  points  to  subtask  i  for  entering  an  undesired 
terminal  state  (i.e.,  a  state  in  7]  but  not  in  G,).  This  has 
the  proper  effect,  but  in  the  MAXQ  hierarchy,  it  causes  the 
value  function  computed  by  the  entire  hierarchy  to  be  in¬ 
correct,  because  it  incorporates  the  (often  non-zero)  proba¬ 
bility  of  receiving  these  terminal  state  penalties. 

A  better  method  is  to  define,  for  each  Max  node  MDP  M,-,  a 
parallel  Markov  decision  problem  M,  with  the  same  states, 
actions,  and  transition  probabilities  as  M,-  but  with  a  second 
reward  function  Ri  that  is  zero  except  for  undesired  termi¬ 
nal  states,  where  it  provides  a  large  penalty.  (We  used  a 
penalty  of  —100  points).  Our  learning  algorithm  will  seek 
a  locally  optimal  policy  ft*  for  Af,-.  However,  it  will  also 
compute  the  value  function  for  executing  n*  in  the  original 
MDP  Mi,  and  this  is  the  value  that  will  be  passed  “up”  the 
MAXQ  hierarchy. 

Specifically,  our  learning  algorithm  MAXQ-Q  is  a  variant 
of  Q  learning  that  performs  the  following.  At  each  compos¬ 
ite  Max  node,  we  maintain  two  tables  C,  (j,a)  and  C,(s,a). 
The  algorithm  chooses  an  action  a  to  perform  according  to 
its  current  exploration  policy.  It  executes  a,  observes  the 
resulting  state  5'  and  reward  /?,(y|5,a),  and  computes  the 
following: 

a*  :=  argmax[C,(s',a')+'4'(^')]  (8) 

a’ 

Ci{s,a)  :=  (l-a,(i))C,(5,a)-t-C(,(i)- 

[Ri{^)+Ri{s'\s,a)+Ci{s',a*)  +  Va^{s')] 

(9) 

Ci{s,a)  :=  (1 -ce,(/))C, •(«,«) +  0,(0- 

[/?/(5'' |5, a)  •+  Ci{s',a*)  +  Va* (/)] 

(10) 

Here  a*  is  the  best  action  in  s'  according  to  the  current  C 
and  V  values.  Both  C  and  C  are  updated  using  a*.  At  each 
leaf  node  i,  the  update  is  slightly  different: 

v;(s)  :=  (1  -(x,(0)v;(5)-i-(x,(0/?/(5'|s,0-  (H) 

The  quantity  ctj  ( i)  is  the  learning  rate  for  node  i  at  time  step 
t. 


In  order  to  prove  convergence  of  this  algorithm,  we  must 
make  several  assumptions.  First,  we  must  assume  that  all 
deterministic  policies  in  MDP  M  are  proper  (i.e.,  they  all 
terminate  with  probability  1).  Second,  we  must  assume 
that  all  locally  optimal  policies,  ft*,  give  the  same  transi¬ 
tion  probability  distribution  P^‘’{s'\s,a).  This  ensures  that 
all  locally  optimal  policies  at  node  a  give  rise  to  the  same 
MDP  at  any  node  i  that  is  a  parent  of  a.  (A  consequence 
of  this  assumption  is  that  all  recursively  optimal  policies 
will  have  the  same  value  function.)  Third,  we  must  as¬ 
sume  that  |V)|,  |C,|,  and  |C,|  are  bounded  at  all  times  (this 
is  easy  to  enforce).  Fourth,  the  exploration  policy  executed 
at  each  node  i  during  learning  must  be  a  GLIE  (greedy  in 
the  limit  with  infinite  exploration)  policy — that  is,  a  policy 
that  executes  each  action  infinitely  often  in  every  state  that 
is  visited  infinitely  often,  and  that  is  greedy  with  respect  to 
Qi  with  probability  1.  Finally,  the  learning  rates  (X,(i)  must 
satisfy  the  usual  conditions: 

T  T 

lim  y,  a, (f)  =  00  and  limya?(i)<<»  (12) 

Theorems  Under  the  assumptions  listed  above,  with 
probability  1,  MAXQ-Q  will  converge  to  a  recursively  op¬ 
timal  policy  for  MDP  M  consistent  with  MAXQ  hierarchy 
H. 

Proof  Sketch:  The  proof  employs  a  stochastic  approxima¬ 
tion  argument  similar  to  those  introduced  to  prove  the  con¬ 
vergence  of  Q  learning  and  SARSA{0)  (Jaakkola,  Jordan,  & 
Singh,  1994;  Bertsekas  &  Tsitsiklis,  1996;  Singh,  Jaakkola, 
Littman,  &  Szpesvari,  1998).  The  proof  is  by  induction  on 
the  levels  of  the  tree,  starting  at  the  Max  nodes  all  of  whose 
children  are  primitive  leaf  nodes.  At  these  “first-level”  Max 
nodes,  the  standard  results  for  Q  learning  can  be  applied  to 
prove  that  the  Q  values  will  converge  with  probability  1 
to  the  optimal  value  function.  Furthermore,  because  each 
node  i  is  executing  a  GLIE  exploration  policy,  the  policy 
at  these  nodes  will  also  converge  with  probability  1  to  a 
locally  optimal  policy. 

Now  consider  a  Max  node  j  all  of  whose  children  are  ei¬ 
ther  primitive  nodes  or  “first-level”  Max  nodes.  Define 
/^•(s'|s,i)  to  be  the  transition  probabilities  observed  by  par¬ 
ent  node  j  when  it  invokes  child  node  i  in  state  5  at  time  t 
in  the  learning  process.  Because  the  first-level  Max  nodes 
are  executing  GLIE  policies,  i)  will  converge  (with 

probability  1)  to  the  state  transitions  Pj{s'\s,i)  that  will  be 
produced  by  any  of  the  locally  optimal  policies  for  node 
i  (by  assumption,  all  of  these  locally  optimally  policies 
give  the  same  state  transition  probabilities).  This  enables 
us  to  prove  that  node  j  also  converges  with  probability  1 
to  the  optimal  Cj  values  and  a  locally-optimal  policy.  The 
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key  is  to  decompose  the  error  in  any  particular  Cj  backup 
into  two  terms.  One  term — corresponding  to  the  difference 
between  a  sample  backup  (using  the  observed  state  tran¬ 
sition)  and  a  full  Bellman  backup  (using  Py(j'|i,i)) — has 
expected  value  of  zero.  The  other  term — corresponding  to 
the  difference  between  doing  a  full  Bellman  backup  using 
the  current  transition  probabilities,  Pj{s'\s,i)  and  doing  a 
full  Bellman  backup  using  the  final  transition  probabilities 
Pj{s'\s,i) — converges  to  zero  with  probability  1.  By  apply¬ 
ing  a  stochastic  approximation  result  (Proposition  4.5  from 
Bertsekas  and  Tsitsiklis,  1996),  we  can  prove  that  node  j 
will  converge  to  a  locally  optimal  policy.  Hence,  by  induc¬ 
tion,  we  can  prove  that  the  entire  hierarchy  converges  to  a 
recursively  optimal  policy.  End  of  Proof  Sketch. 
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There  is  one  interesting  method  that  can  be  employed  to 
accelerate  learning  in  the  higher  nodes  of  the  graph.  When 
an  action  a  is  chosen  for  Max  node  i  in  state  s,  the  exe¬ 
cution  of  a  will  move  the  environment  through  a  series  of 
states  si,...,Sk,Sk+i  =  s'.  U  a  was  indeed  the  best  action 
to  choose  in  si,  then  it  should  also  be  the  best  action  to 
choose  (at  node  i)  in  states  S2  through  Sk-  Hence,  equations 
(9)  and  (10)  can  be  applied  in  all  of  these  states.  This  re¬ 
flects  an  important  difference  between  standard  subroutine 
calls  and  the  MAXQ  hierarchy.  In  standard  subroutines, 
there  is  a  set  of  preconditions  that  must  be  true  at  the  start 
of  the  subroutine.  A  partially-executed  subroutine  can  of¬ 
ten  make  these  preconditions  false,  so  that  it  is  not  possi¬ 
ble  to  interrupt  a  subroutine  and  then  call  it  again  without 
first  re-establishing  the  preconditions.  In  the  MAXQ  hier¬ 
archy,  however,  a  Max  node  i  can  be  invoked  in  any  state 
s  €  Si,  and  it  must  “complete”  execution  of  the  task  from 
that  state  onward.  This  means  that  the  execution  of  the  Max 
node  can  be  interrupted  and  restarted  with  no  change  to  the 
hierarchy. 

We  applied  algorithm  MAXQ-Q  to  the  Taxi  task  using  a 
tabular  representation  of  the  C  functions.  We  employed 
state  abstraction  as  follows.  For  the  QNorth,  QSouth, 
QEast,  and  QWest  nodes,  the  C  function  ignores  the  pas¬ 
senger  source  and  destination  locations  and  the  amount  of 
fuel.  The  C  function  of  QPickup  ignores  the  passenger  des¬ 
tination  and  fuel,  but  it  must  know  the  source  location  and 
taxi  location  in  order  to  predict  the  effects  of  illegal  Pickup 
actions.  Similarly,  QPutdown  ignores  the  passenger  source 
location  and  the  fuel,  and  QFillup  ignores  the  source  and 
destination  locations  and  the  fuel.  QNavigateForGet  can 
represent  its  C  function  by  a  single  value,  because  after 
a  successful  Navigate,  only  a  Pickup  remains  to  complete 
the  Get  action.  The  same  is  true  for  QNavigateForPut  and 
QNavigateForRefuel.  Because  of  the  hierarchical  credit  as¬ 
signment,  QGet  and  QRefuel  need  to  see  the  entire  state, 
but  QPut  can  ignore  all  of  the  state  information,  because 


Figure  4:  Online  performance  of  flat  and  hierarchical  Q  learning 
on  the  Taxi  task.  Each  curve  is  smoothed  using  a  200-trial  moving 
average.  The  horizontal  line  shows  the  average  performance  of 
the  optimal  policy. 

once  it  succeeds,  the  task  is  completed.  All  of  these  ab¬ 
stractions  mean  that  instead  of  a  set  of  seven  8,750-element 
Q  functions  (61,250  values)  for  flat  Q  learning,  the  MAXQ 
hierarchy  requires  only  18,253  values  to  represent  the  C 
functions. 

Figure  4  compares  the  online  performance  of  flat  and  hi¬ 
erarchical  Q  learning.  For  flat  Q  learning,  we  employed 
Boltzmann  exploration  with  an  initial  temperature  of  50. 
This  was  decreased  by  a  factor  of  0.997  after  each  suc¬ 
cessful  trial.  We  experimented  with  many  different  cool¬ 
ing  schedules,  but  we  were  unable  to  get  flat  Q  learning  to 
converge  to  the  optimal  policy  within  50,000  trials.  This 
was  the  fastest  cooling  schedule  that  was  able  to  attain  (at 
least  briefly)  the  optimal  expected  reward.  For  hierarchical 
Q  learning,  we  employed  a  separate  temperature  for  each 
Max  node.  The  starting  temperature  for  all  nodes  was  50 
except  MaxRoot,  which  used  100.  Each  node  decreased  its 
temperature  when  it  successfully  reached  a  goal  terminal 
state.  MaxRoot  was  cooled  by  a  factor  of  0.9986,  the  sec¬ 
ond  level  Max  nodes  at  0.997,  and  MaxNavigate  at  0.995. 
In  all  cases,  a  learning  rate  of  a  =  1  was  employed,  since 
all  actions  and  rewards  are  deterministic. 

These  cooling  rates  were  chosen  so  that  the  lower  Max 
nodes  in  the  graph  can  become  reasonably  competent  at 
their  subtasks  before  the  nodes  higher  in  the  graph  try  to 
learn.  If  care  is  not  taken,  a  Max  node  i  may  conclude  that 
a  subtask  a  is  very  expensive  (because  the  subtask  has  not 
yet  learned  a  good  policy),  and  therefore,  it  sets  the  C  value 
for  a  very  low.  When  this  is  combined  with  Boltzmann  ex¬ 
ploration,  the  result  is  that  the  subtask  may  never  be  tried 
again.  Hence,  we  only  performed  an  update  for  a  Q  node 
if  that  node  completed  its  subtask  with  an  average  absolute 
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Bellman  error  per  step  of  less  than  0.2.  (This  parameter 
was  not  tuned  at  all.) 

Figure  4  shows  that  the  hierarchical  method  is  able  to  learn 
the  task  much  faster  and  achieve  a  higher  level  of  perfor¬ 
mance  than  flat  Q  learning.  Of  course,  both  methods  could 
be  improved  by  employing  techniques  for  accelerating  Q 
learning,  such  as  eligibility  traces  (e.g.,  Peng  &  Williams, 
1996). 

5  Non-Hierarchical  Execution 

We  have  shown  that  the  MAXQ  hierarchy  can  learn  an  op¬ 
timal  policy  for  an  MDP  if  that  policy  is  a  recursively  opti¬ 
mal  hierarchical.  However,  there  are  situations  in  which  the 
optimal  policy  is  almost — but  not  quite — hierarchical.  For 
example,  consider  a  modified  Taxi  task  (the  “fickle  Taxi 
problem”)  in  which  as  soon  as  the  taxi  picks  up  the  pas¬ 
senger  and  moves  one  square,  the  passenger  can  randomly 
change  the  destination  with  probability  0.3.  This  change 
comes  after  the  hierarchical  policy  has  committed  to  exe¬ 
cuting  QNavigateForPut(r)  for  the  original  destination.  As 
a  result,  the  MaxNavigate  subtask  will  take  the  taxi  to  the 
old  destination.  Then  control  will  return  to  MaxPut,  which 
will  invoke  QNavigateForPut  to  move  the  taxi  to  the  new 
destination. 

Such  “almost  hierarchical”  MDP’s  raise  the  question  of 
whether  there  is  a  way  to  convert  a  recursively-optimal  hi¬ 
erarchical  policy  into  an  optimal  non-hierarchical  policy. 

To  answer  this  question,  we  implemented  the  Fickle  Taxi 
domain.  We  removed  all  aspects  of  fuel  from  the  domain 
so  that  we  could  figure  out  the  optimal  policy  and  hand- 
code  it.  Figure  5  compares  the  performance  of  flat  Q  learn¬ 
ing  and  hierarchical  Q  learning  on  this  modified  task.  The 
optimal  policy  can  achieve  an  average  reward  per  step  of 
1.172;  but  the  best  hierarchical  policy  (compatible  with  the 
MAXQ  graph  of  Figure  2)  can  only  achieve  1.002.  Hier¬ 
archical  learning  with  MAXQ-Q  is  able  to  attain  this  level 
rapidly.  Flat  Q  learning  approaches  the  optimum,  but  does 
not  reach  it  within  10,000  trials.  We  tuned  each  algorithm 
to  optimize  its  performance.  We  employed  a  learning  rate 
of  0.35  and  decayed  the  initial  temperature  of  50.0  by  a  fac¬ 
tor  of  .460  (for  flat  Q)  and  .211  (for  hierarchical  Q)  when¬ 
ever  a  goal  terminal  state  was  reached. 

An  alternative  to  hierarchical  execution  of  the  MAXQ 
graph  is  polling  execution,  as  first  suggested  by  Kaelbling 
in  her  (1993)  Hierarchical  Distance  to  Goal  method.  In  the 
polling  approach  to  MAXQ,  each  action  is  chosen  by  start¬ 
ing  at  MaxRoot  and  computing  the  path  (from  root  to  leaf) 
with  the  highest  Q  value.  The  primitive  action  at  the  end  of 
this  path  is  then  executed,  and  the  process  is  repeated.  This 


is  equivalent  to  computing  the  one-step  greedy  lookahead 
policy  given  the  current  value  function.  If  the  hierarchi¬ 
cal  policy  is  not  optimal,  then  this  one-step  greedy  policy 
will  be  closer  to  an  optimal  policy,  because  it  corresponds 
to  one  step  of  policy  improvement  in  the  policy  iteration 
algorithm  (Bertsekas,  1995).  This  informally  proves  the 
following: 

Theorem  4  For  all  states  s,  the  value  of  the  policy  com¬ 
puted  by  polling  execution  of  the  MAXQ  hierarchy  is  >  the 
value  of  the  policy  computed  by  hierarchical  execution. 

Hence,  polling  execution  of  a  MAXQ  graph  can  produce  a 
non-hierarchical  policy  that  is  better  than  the  hierarchical 
policy  represented  by  the  graph. 

We  tested  this  on  the  Fickle  Taxi  task  by  first  training  the 
MAXQ  hierarchy  by  MAXQ-Q  for  1000  trials  and  then 
continuing  the  training  with  polling  execution.  Figure  6 
shows  that  there  is  an  initial  loss  of  performance  when  we 
switch  to  polling  execution.  This  is  because  during  hierar¬ 
chical  training,  the  more  abstract  Q  nodes  in  the  graph  have 
only  learned  their  C  values  well  in  states  where  they  were 
frequently  executed.  Under  polling,  they  are  now  executed 
in  other  states  as  well,  and  they  rapidly  learn  the  correct 
values  so  that  performance  is  able  to  reach  the  level  of  the 
optimal  non-hierarchical  policy.  In  this  domain,  polling  ex¬ 
ecution  of  the  best  hierarchical  policy  can  produce  the  op¬ 
timal  policy. 

6  Concluding  Remarks 

This  paper  has  defined  the  MAXQ  value  function  decom¬ 
position  for  hierarchical  reinforcement  learning.  The  pa¬ 
per  has  shown  that  the  MAXQ  graph  can  represent  the 
value  function  of  any  hierarchical  policy  implemented  by 
the  graph.  A  learning  algorithm  based  on  Q  learning  was 
introduced,  proved  to  converge,  and  shown  experimentally 
to  perform  much  better  than  ordinary,  non-hierarchical  Q 
learning. 

The  most  important  aspect  of  the  MAXQ  method  is  the  sep¬ 
aration  between  the  context-independent  policy  and  value 
function  (represented  by  the  Max  nodes)  and  the  context- 
dependent  value  function  (represented  by  the  Q)  nodes. 
This  permits  the  value  functions  of  subtasks  to  be  learned 
independent  of  their  context,  and  this  enhances  the  re¬ 
usability  of  the  subtasks  and  makes  it  easier  to  employ  state 
abstraction  within  the  subtasks.  However,  optimality  of  the 
learned  policy  is  lost  in  general,  and  hierarchical  credit- 
assignment  problems  may  be  introduced.  Fortunately,  the 
ability  of  the  MAXQ  hierarchy  to  represent  the  value  func¬ 
tion  of  the  hierarchical  policy  permits  the  non-hierarchical 
execution  of  a  one-step  greedy  policy  that  is  better  than  the 
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Figure  5:  Online  performance  of  flat  and  hierarchical  Q  learning 
on  the  Fickle  Taxi  task.  Each  curve  is  the  average  of  10  runs; 
the  returns  from  each  run  were  smoothed  by  a  200-trial  moving 
average. 


hierarchical  policy. 
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Abstract 

Current  methods  to  avoid  overfitting  are  ei¬ 
ther  data-oriented  (using  separate  data  for 
validation)  or  representation-oriented  (penal¬ 
izing  complexity  in  the  model).  This  paper 
proposes  process-oriented  evaluation,  where 
a  model’s  expected  generalization  error  is 
computed  as  a  function  of  the  search  pro¬ 
cess  that  led  to  it.  The  paper  develops 
the  necessary  theoretical  framework,  and  ap¬ 
plies  it  to  one  type  of  learning:  rule  induc¬ 
tion.  A  process-oriented  version  of  the  CN2 
rule  learner  is  empirically  compared  with 
the  default  CN2.  The  process-oriented  ver¬ 
sion  is  more  accurate  in  a  large  majority 
of  the  datasets,  with  high  significance,  and 
also  produces  simpler  models.  Experiments 
in  artificial  domains  suggest  that  process- 
oriented  evaluation  is  particularly  useful  in 
high- dimensional  domains. 

1  INTRODUCTION 

Overfitting  avoidance  is  often  considered  the  central 
problem  of  machine  learning  (e.g.,  (Cheeseman  & 
Oldford,  1994)).  If  a  learner  is  sufficiently  powerful, 
it  must  guard  against  selecting  a  model  that  fits  the 
training  data  well  but  captures  the  underlying  phe¬ 
nomenon  poorly.  Current  methods  to  address  this 
problem  fall  into  two  broad  categories.  Data-oriented 
evaluation  uses  separate  data  to  learn  and  validate 
models,  and  includes  methods  like  cross-validation 
(Breiman,  Friedman,  Olshen  &  Stone,  1984;  Stone, 
1974),  the  bootstrap  (Efron  &  Tibshirani,  1993),  and 
reduced-error  pruning  (Brunk  &  Pazzani,  1991).  It 
has  several  disadvantages:  it  is  often  computationally 


intensive,  reduces  the  data  available  for  learning,  can 
be  unreliable  if  the  validation  set  is  small,  and  is  it¬ 
self  prone  to  overfitting  if  a  large  number  of  models  is 
compared  (Ng,  1997).  Representation-oriented  evalu¬ 
ation  seeks  to  avoid  these  problems  by  using  the  same 
data  for  training  and  validation,  but  a  priori  penaliz¬ 
ing  some  models  as  more  likely  to  overfit.  Bayesian  ap¬ 
proaches  in  general  fall  into  this  category  (Cheeseman, 
1990;  MacKay,  1992).  Representation-oriented  mea¬ 
sures  typically  contain  two  terms,  one  refiecting  fit 
to  the  data,  and  one  penalizing  model  complexity 
(Akaike,  1978;  Schwarz,  1978;  Wallace  &  Boulton, 
1968;  Rissanen,  1978;  Moody,  1992).  This  approach  is 
only  appropriate  when  the  simpler  models  are  truly  the 
more  accurate  ones,  and  there  is  mounting  evidence 
that  this  is  typically  not  the  case  (  (Domingos,  1998; 
Domingos,  1997;  Schuurmans,  Ungar  &  Foster,  1997; 
Lawrence,  Giles  &  Tsoi,  1997;  Webb,  1996;  Schaf¬ 
fer,  1993;  Murphy  &  Pazzani,  1994),  etc.).  Structural 
risk  minimization  (Vapnik,  1995)  and  PAG  learning 
(Kearns  &  Vazirani,  1994)  are  representation-oriented 
methods  that  seek  to  bound  the  difference  between 
training  and  generalization  error  using  a  function  of 
the  model  space’s  (effective)  dimension.  This  typically 
produces  bounds  that  are  overly  broad,  and  requires 
severely  restricting  the  model  space. 

In  this  paper  we  argue  that  representation-oriented 
evaluation  has  these  limitations  because  it  only  con¬ 
siders  the  learner’s  model  space,  and  not  its  search 
process.  A  learner  with  an  unlimited  model  space  can 
avoid  overfitting  as  long  as  it  attempts  only  a  limited 
number  of  hypotheses  (even  if  it  is  not  possible  o  priori 
to  predict  which).  If  these  hypotheses  are  correlated, 
the  chance  of  overfitting  is  further  reduced.  Given 
the  sequence  of  hypotheses  that  a  learner  attempts, 
it  is  possible  to  estimate  the  generalization  error  of 
the  “current  best”  hypothesis  taking  into  account  the 
process  that  led  to  it.  Intuitively,  the  more  hypotheses 
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Figure  1:  A  simple  example  of  an  overfitting  avoidance 
problem. 

that  have  been  attempted  and  the  less  correlated  they 
are,  the  higher  the  generalization  error  we  expect  for  a 
given  training-set  error.  This  paper  begins  to  develop 
this  approach,  which  we  will  call  process-oriented  eval¬ 
uation  (POE  for  short).  The  basic  theoretical  frame¬ 
work  is  presented,  and  then  applied  to  the  standard 
“separate  and  conquer”  rule  induction  process  (Clark 
k  Niblett,  1989).  An  empirical  study  demonstrates 
the  effectiveness  of  POE.  The  paper  concludes  with 
sections  on  related  and  future  work. 

2  PROCESS-ORIENTED 
EVALUATION 

Consider  the  simplest  example  of  an  overfitting  avoid¬ 
ance  problem,  in  a  classification  context.  Suppose 
learner  Li  consists  of  drawing  one  hypothesis  at  ran¬ 
dom  from  some  model  space  and  returning  it,  and 
learner  L2  consists  of  drawing  two  hypotheses  at  ran¬ 
dom  (independently)  from  the  same  model  space  as  Li , 
and  returning  the  one  with  lowest  error  on  a  training 
sample  S.  This  situation  is  shown  schematically  in 
Figure  1.  Let  hi  be  the  hypothesis  returned  by  Li, 
h2  the  hypothesis  returned  hy  L2,  n  the  number  of 
examples  in  5,  and  e;  the  number  of  examples  hi  mis- 
classifies.  The  goal  is  to  choose  the  hypothesis  with 
lowest  true  error  Cj  (i.e.,  Cj  is  the  probability  of  hi 
misclassifying  an  example,  given  the  true  example  dis¬ 
tribution).  Suppose  n  =  100,  ei  =  12,  and  €2  =  11. 
Should  we  prefer  hi  or  /12?  According  to  the  maximum 
likelihood  principle  (DeGroot,  1986),  €1  =  0.12  and 
62  =  0.11,  so  h2  should  be  chosen.  Assuming  the  two 
hypotheses  have  the  same  complexity  or  prior  prob¬ 
ability,  representation-oriented  evaluation  would  give 
the  same  answer.  However,  L2  had  two  opportunities 
to  draw  a  hypothesis  with  low  training  error,  and  so 
the  probability  of  62  being  low  merely  by  chance  is 
higher  than  for  ei .  Thus  /12  niay  in  fact  have  a  higher 
true  error  rate  than  hi . 


This  notion  can  be  quantified.  If  a  hypothesis  h’s  true 
error  rate  is  e  and  S  consists  of  n  independently  drawn 
examples,  the  number  of  errors  e  committed  by  h  on 
5  is  a  binomially  distributed  variable  with  parameters 
n  and  e: 

p{e\n,  e)  -  b{e\n,  e)  =  ^  ^  e*(l  -  e)"“'  (1) 

Let  B{e\n,€)  be  the  probability  that  the  number  of 
errors  is  greater  than  e: 

n 

B{e\n,e)=  ^  b{e\n,e)  (2) 

i=e+l 

Notice  that  this  notation  is  the  opposite  of  the  usual 
notation  for  a  cumulative  distribution  function  (i.e., 
B{e\n,€)  =  1  -  BinomiaLcdf(e|n,e)).  It  will  be  more 
convenient  for  what  follows. 

The  probability  of  hi  misclassifying  ei  examples  is 
p(ei|n,ei)  =  6(ei|n,ei).  This  can  be  used  with  Bayes’s 
theorem  to  compute  the  expected  value  of  ci  given  n 
and  ei,  E[ei|n,ei].  By  finding  a  similar  expression  for 
p(e2|n,e2))  we  can  compute  E[e2\n,e2]  and  choose  the 
hypothesis  with  lowest  expected  error.  Let  the  two 
hypotheses  drawn  by  L2  be  /i2,i  and  /i2,2  (with  true 
errors  €2,1  and  62,2  respectively,  and  numbers  of  train¬ 
ing  errors  62,1  and  62,2).  From  these,  L2  chooses  the 
one  with  lowest  training  error  (i.e.,  /12  =  h2j,  where 
j  =  argmin^gj  2  ^2,i)-  Then  the  probability  of  L2  re¬ 
turning  a  hypothesis  /12  that  misclassifies  62  training 
examples  is  the  probability  that  /i2,i  misclassifies  62 
training  examples  and  /i2,2  misclassifies  more,  or  vice- 
versa,  or  both  /i2,i  and  /i2,2  misclassify  62  examples: 


p(e2|n,e2)  =  t(e2|n,  e2,i)5(e2|n,  62,2) 

-fB(e2|n,e2,i)6(e2|n,e2,2) 

+  b{e2\n,e2,i)b{e2\n,€2,2)  (3) 

Our  goal  is  to  use  this  equation  to  compute  the  ex¬ 
pected  value  of  €2-  We  are  hindered  by  the  fact  that 
in  addition  to  62  (whether  it  is  £2,1  or  62,2)  the  equa¬ 
tion  contains  another  unknown  parameter  (whichever 
€2,i  is  not  62).  Since  we  are  not  interested  in  €2,1  or 
^2.2  per  se,  but  only  in  the  effect  on  62  of  trying  two 
hypotheses  instead  of  of  one,  we  propose  the  following 
heuristic:  assume  that  £2,1  =  e2,2  =  ez-  This  approx¬ 
imation  will  be  good  if  €2,1  and  £2,2  are  similar,  and 
poor  if  they  are  very  different.  However,  this  heuristic 
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may  yield  good  results  even  in  the  latter  case,  because 
a  close  approximation  of  E[e2\n,e2]  is  not  required; 
all  that  is  required  is  that  £'[€2|n, 62]  >  E[ei\n,ei]  iff 
€2  >  Cl ,  which  is  a  much  weaker  condition  (Domingos 
k,  Pazzani,  1997).  If  £2,1  =  £2,2  =  f-2  Equation  3  be¬ 
comes: 


p(e2|n,e2)  =  6(02  |n,  £2)5(62  |n,  £2) 
-|-5(e2|n,£2)6(e2|n,£2) 

-b  6(62  |n, £2)6(62  |n, £2) 

=  [5(62 |n,  £2) -1-6(62 |n,  £2)]^ 
-5^(62|n,£2) 

=  52(62-l|n,£2)  -  5^(62|n,£2)  (4) 
Applying  Bayes’s  theorem: 

p(£2|n,e2)  ocp(£2)p(e2|n,£2)  (5) 

p(€2)  can  be  used  to  incorporate  prior  beliefs  about  the 
error  rate  of  the  hypotheses  considered  by  L2.  Here  it 
will  simply  be  assumed  uniform:^ 

p(£2|n,e2)  ap(e2|n,£2)  (6) 

The  expected  value  of  £2  can  now  be  computed  by 
integration: 

/  e2P(e2|n,£2)  d£2 

E[e2\n,  62]  =  (7) 

/  P(e2|n,£2)  d£2 
Jo 

Doing  this  for  62  =  11,  n  =  100  results  in  5[£2|n,e2]  = 
0.134.  A  similar  treatment  for  ei,  using  ei  =  12,  n  = 
100  and  p(ei|n,£i)  =  6(ei|n,£i),  yields  5[£i|n,ei]  = 
0.127.  Thus  the  hypothesis  output  by  Li  would  be 
preferred,  even  though  L2’s  has  a  lower  training  error. 

Equation  4  can  be  readily  generalized  to  a  learner  Lm 
that  draws  m  hypotheses  at  random  and  chooses  the 
one  with  lowest  training  error: 

^This  is  an  unrealistic  assumption,  and  is  made  solely 
for  the  sake  of  simplicity.  As  the  following  sections  show, 
the  proposed  method  can  be  effective  even  when  this  as¬ 
sumption  is  used.  This  can  be  attributed  to  the  fact  that, 
except  for  very  small  sample  sizes  and/or  very  extreme  pri¬ 
ors,  the  effect  of  the  likelihood  term  p(e2|n,  £2)  will  easily 
dominate  the  prior’s.  In  any  case,  a  version  of  POE  using 
beta  priors  is  currently  being  implemented. 


P(£m|n,em)  OC  p(Cm|n,£m) 

=  5’"(e„^-l|n,£,„)  -  5’"(em|n,£„) 

(8) 

Notice  that  this  formula  makes  intuitive  sense:  as  m 
increases,  the  mass  of  probability  is  shifted  to  higher 
and  higher  £m’s;  but  as  n  increases,  higher  and  higher 
m’s  are  needed  to  make  this  happen  to  the  same  de¬ 
gree.  To  see  this,  consider  the  binomial  expansion 

5’"(em-l|n,£m) 

—  [5(6ni|u,  £ni)  -f*  6(e7nltl,  £m)] 

—  5  (e7n|n,  £ni)  -f"  m5  (£mitt,  £m)6(e77x|7t,  £,,1) 

_^m(rn - £m)6^(em|n,  £m)  -b  •  •  ■ 

(9) 

and  consider  that,  for  all  but  the  smallest  sample  sizes, 
B{em\n,€m)  »  6(em|n,£m)-  Thus: 


p(em|n,em)  oc  p(em|n,em) 

=  5’"(e„-l|n,£m) -5’"(em|n,£m) 

~  m6(em|n,£m)5’”~^(emln,em) 

(10) 

When  m  =  1,  this  reduces  to  6(em|Ti,  £„),  as  expected. 
When  m  =  2,  6(em|n,£m)  is  multiplied  by  a  constant 
and  by  B{em\'fi,em)-  Since  the  latter  is  a  function 
that  increases  monotonically  with  £„  for  a  given  n  and 
Cm,  the  effect  of  this  is  to  decrease  the  probability  of 
lower  £m’s  and  increase  the  probability  of  higher  ones, 
and  thus  to  increase  the  expected  £„■  As  m  increases, 
b{em\n,em)  is  multiplied  by  higher  and  higher  powers 
of  B{em\n,em)-  This  further  decreases  the  probabil¬ 
ity  of  low  £m’s  and  increases  the  probability  of  high 
ones,  leading  to  an  ever-increasing  expected  Cm-  As 
an  example.  Figure  2  shows  6(25|50,  £m)  (magnified 
by  a  factor  of  five)  and  several  powers  of  5(25|50,  em)- 
The  resulting  5[£m [50,25]  (not  shown)  has  a  roughly 
similar  shape  to  6(25|50,  Cm),  but  shifts  rightward  in 
step  with  5(25|50,  e^).  For  larger  n,  the  same  process 
takes  place,  but  b{em\n,em)  is  more  sharply  peaked, 
5(em|n,  £m)  also  transitions  from  values  close  to  zero 
to  values  close  to  one  more  sharply,  and  the  advance 
of  5"’(em|u,  £m)  to  the  right  becomes  correspondingly 
slower  (since,  for  any  0<A:<y<l,  asy->l  with 
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Figure  2;  Variation  of  b{em\n,€m)  and  powers  of 
B{em\n,€m)  with  for  n  =  50,  =  n/2. 

k  held  constant  higher  and  higher  m’s  are  needed  to 
make  y™  <  fc).  This  can  be  seen  by  comparing  Fig¬ 
ure  2  with  Figure  3,  which  shows  the  corresponding 
plots  for  n  =  500. 

Equation  8  still  assumes  that  all  m  hypotheses  drawn 
are  independent,  but  it  can  be  further  generalized  to 
include  the  dependent  case: 

p(em|n,em)  a  p{em\n,€m) 

=  p(Vi<j<m  6m, i  ^  Om\'Ot6m) 

p(Vi<j<rji  6m, i  ^  6m\'n^i6m^ 

Evaluating  this  expression  when  high-order  dependen¬ 
cies  are  present  will  generally  not  be  feasible,  but 
the  standard  Bayesian  network  approach  (Heckerman, 
1996)  is  applicable  here:  the  number  of  training  errors 
6m, i  of  each  hypothesis  hm,i  generated  by  Lm  can  be 
viewed  as  a  node  in  a  Bayesian  network,  whose  par¬ 
ents  are  the  training  errors  of  the  hypotheses  hm,j  it  is 
primarily  dependent  on.  For  example,  in  many  greedy 
search  processes  (e.g.,  standard  decision  tree  induc¬ 
tion),  if  hm,3  was  derived  from  hm,2,  which  in  turn 
was  derived  from  hm,i ,  Cm.s  will  be  approximately  in¬ 
dependent  of  6m,i  given  em,2-  In  general,  the  Bayesian 
network  for  a  given  learning  process  will  have  the  DAG 
(directed  acyclic  graph)  of  the  search  process  itself  as 
a  subgraph  (e.g.,  in  a  greedy  search  each  node  em,i 
will  have  arcs  to  the  training  errors  of  the  hypotheses 
that  were  generated  from  hm,i)-  If  po,r{6m,i)  are  the 
parents  of  6m,i  in  the  Bayesian  network,  Equation  11 
above  reduces  to: 


Figure  3:  Variation  of  b{em\n,em)  and  powers  of 
B{6m\n,6m)  with  Cm  for  n  =  500,  em  =  n/2. 


piem\n,em)  (xp{em\n,em)  = 

m 

Y[piem,i  >  Cm  |n,  Cm )  ^  gpor(em,i)  Cm,j  ^  Cm) 

t=l 

m 

~  JJp(Cm,t  ^  Cm  |n.  Cm ,  y  gpQr(em,j)  Cm,j  Cm) 

i=l 

(12) 

L\  and  L2  above  were  considered  to  be  different  learn¬ 
ers,  but  they  can  equally  well  be  considered  different 
stages  of  the  same  learner.  For  example,  L2  can  take 
the  hypothesis  output  by  L\  as  its  own  first  hypothesis. 
More  generally,  Lm  can  be  the  result  of  continuing  the 
search  of  learner  Ljt  (fc  <  m)  with  m  —  k  more  hypothe¬ 
ses.  Thus  this  framework  can  be  applied  to  problems 
like  decision  tree  and  rule  pruning,  to  which  we  now 
turn. 

3  AN  APPLICATION:  RULE 
INDUCTION 

Most  rule  induction  systems  employ  a  set  covering  or 
“separate  and  conquer”  search  strategy  (Michalski, 
1983;  Clark  k  Niblett,  1989).  Rules  are  induced  one 
at  a  time,  and  each  rule  starts  with  a  training  set  com¬ 
posed  of  the  examples  not  covered  by  any  previous 
rules.  A  rule  is  induced  by  adding  conditions  one  at  a 
time,  starting  with  none  (i.e.,  the  rule  initially  covers 
the  entire  instance  space).  The  next  condition  to  add 
is  chosen  by  attempting  all  possible  conditions.  Con- 
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ditions  on  symbolic  attributes  are  typically  of  the  form 
Oj  =  Vi j,  where  Vij  is  a  possible  value  of  attribute  a,. 
Conditions  on  numeric  attributes  are  typically  of  the 
form  fli  <  Vij  or  Oi  >  Vij ,  where  the  thresholds  Vij  are 
usually  values  of  the  attribute  that  appear  in  the  train¬ 
ing  set.  In  the  beam  search  process  used  by  many  rule 
learners,  at  each  step  the  best  b  versions  of  the  rule 
according  to  some  evaluation  function  are  selected  for 
further  specialization.  AQ  (Michalski,  1983)  continues 
adding  conditions  until  the  rule  is  “pure”  (i.e.,  until  it 
covers  examples  of  only  one  class).  This  can  lead  to  se¬ 
vere  overfitting.  The  latest  version  of  the  CN2  system 
(Clark  &  Boswell,  1991)  uses  a  simple  and  effective 
Bayesian  method  to  combat  this;  induction  of  a  rule 
stops  when  no  specialization  improves  its  error  rate, 
and  the  latter  is  computed  using  a  Laplace  correction 
or  m- estimate.  If  Ur  is  the  number  of  examples  covered 
by  a  rule  r,  and  Cr  is  the  number  of  those  examples 
it  misclassifies,  the  conventional  estimate  of  the  rule’s 
error  rate  is  er/ur,  but  its  m-estimate  is; 


where  eo  is  the  rule’s  a  priori  error,  which  CN2  takes  to 
be  the  error  obtained  by  random  guessing  if  all  classes 
are  equally  likely;  eo  =  (c  -  l)/c,  where  c  is  the  num¬ 
ber  of  classes.  This  prior  value  is  given  a  weight  of  m 
examples  (i.e.,  the  behavior  of  Equation  13  is  equiva¬ 
lent  to  having  m  additional  examples  covered  by  the 
rule,  one  of  each  class).  CN2  uses  m=c.  As  condi¬ 
tions  are  added,  the  rule  covers  fewer  and  fewer  ex¬ 
amples,  and  tr  tends  to  cq.  Thus  a  rule  making  more 
misclassifications  may  be  preferred  if  it  covers  more 
examples,  causing  induction  to  stop  earlier  and  reduc¬ 
ing  overfitting.  Clark  and  Boswell  (Clark  &  Boswell, 
1991)  found  this  version  of  CN2  to  be  more  accurate 
than  C4.5  (Quinlan,  1993)  on  10  of  the  12  bench¬ 
mark  datasets  they  used  for  testing.  However,  this 
scheme  ignores  that,  as  more  and  more  conditions  are 
attempted,  the  probability  of  finding  one  that  appears 
to  reduce  the  rule’s  error  merely  by  chance  increases. 
This  will  lead  the  m-estimate  to  underestimate  the 
chosen  condition’s  true  error,  and  CN2  to  overfit.  The 
upward  correction  made  to  Cr  should  increase  with  the 
number  of  conditions  attempted.  The  process-oriented 
evaluation  framework  described  in  the  previous  section 
allows  us  to  do  this  in  a  systematic  way. 

Let  each  hypothesis  be  one  version  of  the  rule  at¬ 
tempted  during  the  beam  search.  The  main  change 
to  Equation  8  required  is  to  take  into  account  that 
different  versions  of  a  rule  will  cover  different  numbers 


of  training  examples.  In  other  words,  n  is  now  a  func¬ 
tion  of  the  hypothesis,  and  the  hypothesis  with  lowest 
Cifui  is  chosen.  Let  rim  =  (ni, . . . ,  nj, . . . ,  rim),  where 
Ui  is  the  number  of  examples  covered  by  rule  version  i, 
and  let  Cm  =  mini<i<m  {ei/n,}  be  the  lowest  training- 
set  error  rate  found  so  far.  Equation  8  becomes; 


p(Cm|^mjCm)  OC  p(6mjnm)  ^m) 

m  m 

Cm)  ^m) 

z=l  *=1 

(14) 

This  equation  does  not  need  to  be  computed  for  ev¬ 
ery  rule  version  generated  during  the  beam  search,  but 
only  once  for  each  round.  One  round  consists  of  gen¬ 
erating  every  possible  one-step  specialization  of  each 
rule  version  in  the  beam,  and  selecting  the  b  best. 
Thus,  if  there  are  a  attributes  and  v  is  the  maximum 
number  of  values  of  any  attribute  (in  the  worst  case, 
V  =  n  for  numeric  attributes),  one  round  corresponds 
to  O(bav)  rule  versions.  Let  m*  be  the  total  num¬ 
ber  of  rule  versions  generated  up  to,  and  including, 
round  fc.  Round  1  consists  of  the  initial  rule  with 
no  conditions,  and  mi  =  1.  Induction  stops  when 
^  i  f®'^  fc  >  1. 

Equation  14  is  of  course  only  a  first  approximation. 
Many  other  aspects  of  the  rule  induction  process  can 
be  taken  into  account  using  Equation  12,  and  making 
approximations  as  needed  for  computational  efficiency. 
A  version  of  CN2  that  takes  into  account  the  depen¬ 
dence  between  each  rule  version  and  its  parent  (i.e., 
the  rule  version  it  specializes  by  one  condition)  is  cur¬ 
rently  being  implemented. 

4  EMPIRICAL  STUDY 

In  order  to  test  the  effectiveness  of  process-oriented 
evaluation,  default  and  process-oriented  versions  of 
CN2  were  compared  on  the  benchmark  datasets  previ¬ 
ously  used  by  Clark  and  Boswell  (1991).^  The  process- 
oriented  version  was  implemented  by  adding  the  nec¬ 
essary  facilities  to  the  CN2  source  code.  Numerical  in¬ 
tegration  (Equation  7)  was  performed  using  Simpson’s 
rule,  and  B{e\n,e)  (Equation  2)  was  computed  using 
the  incomplete  beta  function  (Press,  Teukolsky,  Vet- 
terling  &  Flannery,  1992).  Integrating  Equation  14  ev¬ 
ery  time  needs  to  be  computed  (once 

^With  the  exception  of  pole-and-cart,  which  is  not  avail¬ 
able  in  the  UCI  repository  (Merz,  Murphy  k.  Aha,  1997). 
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per  round)  would  generally  significantly  slow  down  the 
rule  induction  process.  Instead,  it  was  approximated 
by: 


p(em|n,em)  ocp(e„|n,em)  = 

jB"*(nem-l|n,em)  -  (15) 

where  n=  ^  This  replaces  each  of  the  prod¬ 

ucts  with  a  single-step  computation,  speeding  up  eval¬ 
uation  by  0(m).  CN2’s  Laplace  estimates  are  still  used 
to  choose  the  best  b  specializations  in  each  round.  This 
is  preferable  to  using  uncorrected  estimates,  since  as 
implemented  POE  has  no  preference  between  hypothe¬ 
ses  within  the  same  round,  and  this  is  also  a  factor  in 
avoiding  overfitting.  However,  the  Laplace  correction 
distorts  the  values  used  by  Equation  15.  This  will  be 
particularly  pronounced  when  there  are  many  classes, 
since  CN2  uses  m  =  c.  In  order  to  minimize  this  prob¬ 
lem,  m  =  2  was  used  with  POE.^ 

The  experimental  procedure  of  (Clark  &  Boswell, 
1991)  was  followed.  Each  dataset  was  randomly  di¬ 
vided  into  67%  for  training  and  33%  for  testing,  and 
the  error  rate  and  theory  size  (total  number  of  condi¬ 
tions)  were  measured  for  default  CN2  and  CN2-POE. 
This  was  repeated  20  times.  The  average  results  and 
their  standard  deviations  are  shown  in  Table  I."* 

POE  reduces  CN2’s  error  rate  in  8  of  the  11  datasets. 
Using  a  sign  test,  these  results  are  significant  at  the 
4%  level.  In  other  words,  POE  improves  CN2  with 
high  confidence.  It  also  produces  simpler  rule  sets  in 
all  but  two  of  the  datasets.  With  the  approximation 
used,  POE  did  not  noticeably  increase  CN2’s  running 
time.  This  is  also  due  to  the  fact  that  POE  tends  to 
make  induction  stop  sooner  than  in  default  CN2,  as 
evinced  by  the  theory  size  results. 

While  these  results  are  encouraging,  they  do  not  nec¬ 
essarily  prove  that  CN2-POE  reduces  overfitting  by 
taking  into  account  the  increasing  number  of  rule  ver¬ 
sions  generated  as  search  progresses.  If  this  is  indeed 
what  is  taking  place,  the  difference  in  error  between  de¬ 
fault  CN2  and  CN2-POE  {errorcN2 -errorcN2-poE) 
should  increase  with  the  dataset’s  number  of  at- 

^Simply  changing  m  =  c  to  m  =  2  in  default  CN2  does 
not  change  its  performance  on  the  datasets  used. 

‘'There  axe  some  differences  between  CN2’s  results  and 
those  reported  in  (Clark  &:  Boswell,  1991).  This  may  be 
due  to  the  fact  that  the  default  version  of  CN2  uses  a  beam 
size  of  5,  whereas  Clark  and  Boswell  used  b  =  20.  The 
distribution  version  of  CN2  may  also  differ  from  the  one 
used  in  (Clark  &:  Boswell,  1991). 


tributes,  since  this  will  increase  the  number  of  rule 
versions  generated  in  each  round.  In  order  to  test  this 
hypothesis,  experiments  were  carried  out  in  artificial 
domains.  Concepts  defined  as  Boolean  functions  in 
disjunctive  normal  form  were  used  as  targets.  The 
datasets  were  composed  of  100  training  examples  and 
1000  test  examples  described  by  a  variable  number  of 
attributes  a.  The  number  of  literals  d  in  each  dis¬ 
junct  was  generated  at  random,  with  a  mean  of  d  =  5 
and  a  variance  of  5  x  (1  —  ^).  This  is  obtained  by 
including  each  literal  in  the  disjunct  with  probability 
|.  Literals  were  negated  or  not  with  equal  probabil¬ 
ity.  The  number  of  disjuncts  was  set  to  2'*“*  =  16, 
which  ensures  the  concept  covers  roughly  half  the  in¬ 
stance  space.  Equal  numbers  of  positive  and  negative 
examples  were  included  in  the  dataset,  and  positive  ex¬ 
amples  were  divided  evenly  among  disjuncts.  In  each 
run  a  different  target  concept  was  used.  One  hundred 
runs  were  conducted  for  each  value  of  a  between  10 
and  100  (at  intervals  of  5),  and  the  correlation  be¬ 
tween  (errorcN2  —  orrorcN2-POE)  and  a  was  mea¬ 
sured.  This  was  found  to  be  highly  positive  {p  =  0.66), 
confirming  our  hypothesis. 

5  RELATED  WORK 

The  literature  on  model  selection  and  error  estimation 
is  very  large,  and  we  will  not  attempt  to  review  it 
here.  The  incompleteness  of  representation-oriented 
evaluation  was  noted  20  years  ago  by  Pearl  (1978): 

It  w'ould,  therefore,  be  more  appropriate  to 
connect  credibility  with  the  nature  of  the  se¬ 
lection  procedure  rather  than  with  properties 
of  the  final  product.  When  the  former  is  not 
explicitly  known  . . .  simplicity  merely  serves 
as  a  rough  indicator  for  the  type  of  processing 
that  took  place  prior  to  discovery. 

Huber  (St.  Amant  &  Cohen,  1997;  Huber,  1994)  ex¬ 
presses  thus  the  need  for  process-oriented  evaluation: 

Data  analysis  is  different  from,  for  exam¬ 
ple,  word  processing  and  batch  programming: 
the  correctness  of  the  end  product  cannot  be 
checked  without  inspecting  the  path  leading 
to  it. 

Several  pieces  of  previous  work  take  into  account  the 
number  of  hypotheses  being  compared,  and  so  can  be 
considered  early  steps  towards  process-oriented  eval¬ 
uation.  This  includes  notably  systems  that  use  the 
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Table  1:  Empirical  results:  error  rates  and  theory  sizes  of  default  CN2  and  CN2  with  process-oriented  evaluation 
(CN2-POE). 


Dataset 

Error  rate 

CN2  CN2-POE 

Theory  size 

CN2  CN2-POE 

Breast 

30.0±1.4 

29.7±1.4 

114.5±2.4 

58.7±2.6 

Echocardio 

32.7±1.2 

32.3±1.3 

42.9±1.2 

35.4±2.1 

Glass 

39.0±1.5 

38.3±1.7 

51.8±1.0 

54.7±1.1 

HeartC 

20.8±0.8 

22.5±0.8 

57.8±0.9 

52.0±1.0 

HeartH 

22.4±1.1 

21.8±1.3 

69.2±1.5 

60.3±1.4 

Hepatitis 

21.2±0.9 

19.2±1.3 

40.2±1.7 

34.0±1.3 

Lympho 

21.4±1.1 

24.1±1.1 

39.5±0.7 

38.7±1.0 

Soybean 

19.5±1.0 

19.4±1.0 

116.7±2.3 

110.9±3.1 

Thyroid 

4.1±0.2 

3.8±0.2 

97.5±2.0 

104.8±2.0 

Tumor 

60.1±1.0 

65.1±1.3 

302.8±4.6 

273.9±4.4 

Voting 

4.8±0.4 

4.3±0.3 

61.7±2.9 

49.6±2.5 

Bonferroni  correction  when  testing  significance  (e.g., 
(Kass,  1980;  Gaines,  1989;  Jensen  &:  Schmill,  1997); 
see  also  (Miller,  1981;  Klockars  &  Sax,  1986;  Westfall 
&  Wolfinger,  1997)),  A  key  difference  between  these 
systems  and  what  is  proposed  here  is  that  they  require 
a  somewhat  arbitrary  choice  of  significance  threshold, 
while  this  paper  directly  attempts  to  optimize  the  end 
goal  (expected  generalization  error).  Also,  the  Bonfer¬ 
roni  correction  does  not  take  hypothesis  dependencies 
into  account,  while  the  present  framework  offers  (at 
least  in  principle)  a  way  of  doing  so. 

Quinlan  and  Cameron-Jones’s  (1995)  “layered  search” 
method  for  automatically  selecting  CN2’s  beam  width 
can  also  be  considered  a  form  of  process-oriented  eval¬ 
uation.  While  layered  search  and  CN2-POE  have  sim¬ 
ilar  aims,  their  biases  differ:  layered  search  limits  the 
search’s  width,  while  CN2-POE  limits  its  length.  The 
latter  may  be  more  effective  in  reducing  the  fragmenta¬ 
tion  and  small  disjuncts  problems  (Pagallo  &  Haussler, 
1990;  Holte,  Acker  &  Porter,  1989).  The  assumptions 
made  by  the  heuristic  proposed  here  are  also  clearer 
than  those  implicit  in  Quinlan  and  Cameron-Jones’s 
measure. 

Evaluating  models  that  are  the  result  of  a  search 
process,  not  just  of  fitting  the  parameters  of  a  pre¬ 
determined  structure,  has  traditionally  not  been  a  con¬ 
cern  of  statisticians.  However,  this  is  beginning  to 
change  (Chatfield,  1995). 

Some  of  the  arguments  made  here  for  taking  into  ac¬ 
count  the  number  of  hypotheses  attempted  are  made 
in  greater  detail  in  (Cohen  &  Jensen,  1997)  and  (Ng, 
1997).  The  present  paper  goes  further  in  arguing  that 
other  aspects  of  the  search  process  should  also  be  taken 


into  account  whenever  possible  (for  example,  in  rule 
induction,  the  number  of  examples  covered  by  each 
hypothesis). 

6  FUTURE  WORK 

The  development  and  evaluation  contained  in  this  pa¬ 
per  are  obviously  only  preliminary.  As  mentioned 
above,  a  version  of  CN2-POE  that  takes  hypothesis 
dependencies  into  account  is  currently  being  imple¬ 
mented.  Applications  of  POE  to  decision  tree  in¬ 
duction,  backpropagation,  instance  selection,  feature 
selection  and  discretization  are  also  areas  for  future 
work.  In  each  case,  the  main  issue  is  likely  to  be  find¬ 
ing  the  optimal  trade-off  between  the  computational 
and  mathematical  complexity  of  POE  and  its  payoff 
in  reduced  error  rates.  The  success  of  the  enterprise  is 
likely  to  hinge  on  distinguishing  strong  dependencies 
from  weak  ones  that  can  be  ignored,  and  on  finding  ef¬ 
ficient  but  roughly  correct  approximations.  For  most 
learners  in  most  domains,  it  is  probably  not  realis¬ 
tic  to  expect  large  error  reductions  from  POE,  since 
it  does  not  change  the  underlying  representation  or 
search  process.  However,  if  POE’s  gains  are  small  but 
consistent  across  a  broad  spectrum  of  learners  and  do¬ 
mains,  it  will  still  be  worth  developing. 

The  POE  error  estimates  introduced  in  this  paper  have 
two  types  of  statistical  bias.  One  stems  from  the  fact 
that,  because  evaluation  focuses  on  the  lowest  error 
found,  low  outliers  have  a  stronger  effect  than  high 
ones,  leading  to  a  negative  bias  (i.e.,  underestimating 
error).  This  bias  can  be  estimated  and  the  POE  val¬ 
ues  corrected.  This  is  an  area  of  current  work.  The 
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second  source  of  bias  is  the  assumption  that  all  hy¬ 
potheses  tried  by  the  learner  have  similar  error  rates. 
This  will  lead  to  a  positive  bias  when  the  error  rate 
is  decreasing  (i.e.,  POE  will  tend  to  overestimate  er¬ 
ror  at  least  up  to  the  point  where  the  learner  starts 
overfitting).  One  way  to  overcome  this  is  to  intro¬ 
duce  explicit  expectations  about  the  evolution  of  the 
learner’s  error  as  search  progresses.  For  example,  a 
specific  type  of  curve  may  be  assumed,  or  an  “expected 
curve”  can  be  compiled  by  cross-validation.  Another 
approach  is  to  avoid  the  assumption  of  similar  error 
rates,  for  example  by  marginalizing  over  the  true  error 
rates  of  all  hypotheses  but  the  chosen  one,  or  by  us¬ 
ing  their  maximum-likelihood  estimates.  Both  of  these 
approaches  are  also  currently  being  studied. 

The  ultimate  goal  of  POE  is  to  accurately  predict  a 
hypothesis’s  generalization  error  from  its  training-set 
error,  using  knowledge  of  how  the  hypothesis  was  ob¬ 
tained.  How  far  this  is  possible  remains  an  open  ques¬ 
tion. 

7  CONCLUSION 

Two  main  types  of  model  selection  are  currently  avail¬ 
able.  In  data-oriented  evaluation,  a  hypothesis’s  score 
does  not  depend  on  its  form  or  how  the  hypothe¬ 
sis  was  found,  but  only  on  its  performance  on  the 
data.  In  representation-oriented  evaluation,  the  score 
depends  on  the  data  and  on  the  hypothesis’s  form, 
but  not  on  the  search  process  that  led  to  it.  This  pa¬ 
per  argued  that  the  latter  cannot  be  ignored,  and  pro¬ 
posed  process-oriented  evaluation  (POE),  which  takes 
all  three  factors  into  account.  An  application  of  POE 
to  the  CN2  rule  induction  system  was  found  to  reduce 
error  in  8  of  11  benchmark  datasets,  and  produce  sim¬ 
pler  theories  in  9.  Experiments  in  artificial  domains 
support  the  hypothesis  that  these  gains  stem  at  least 
partly  from  CN2-POE’s  use  of  search  process  informa¬ 
tion. 
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Abstract 

Relational  reinforcement  learning  is  pre¬ 
sented,  a  learning  technique  that  combines 
reinforcement  learning  with  relational  learn¬ 
ing  or  inductive  logic  programming.  Due  to 
the  use  of  a  more  expressive  representation 
language  to  represent  states,  actions  and  Q- 
functions,  relational  reinforcement  learning 
can  be  potentially  applied  to  a  new  range  of 
learning  tasks.  One  such  task  that  we  inves¬ 
tigate  is  planning  in  the  blocks  world,  where 
it  is  assumed  that  the  effects  of  the  actions 
are  unknown  to  the  agent  and  the  agent  has 
to  learn  a  policy.  Within  this  simple  domain 
we  show  that  relational  reinforcement  learn¬ 
ing  solves  some  existing  problems  with  rein¬ 
forcement  learning.  In  particular,  relational 
reinforcement  learning  allows  us  to  employ 
structural  representations,  make  abstraction 
of  specific  goals  pursued  and  exploit  the  re¬ 
sults  of  previous  learning  phases  when  ad¬ 
dressing  new  (more  complex)  situations. 

1  INTRODUCTION 

Within  the  field  of  machine  learning,  both  reinforce¬ 
ment  learning  [8]  and  inductive  logic  programming  (or 
relational  learning)  [12, 10]  have  received  a  lot  of  atten¬ 
tion  since  the  early  nineties.  It  is  therefore  no  surprise 
that  both  Leslie  Pack  Kaelbling  and  Richard  Sutton 
(in  their  invited  talks  at  IJCAI-97,  Nagoya,  Japan) 
suggested  to  study  the  combination  of  these  two  fields. 

From  the  reinforcement  learning  point  of  view,  this 
could  significantly  extend  the  application  perspective. 
Most  representations  used  in  reinforcement  learning 
are  inadequate  for  describing  planning  tasks  such  as 
the  simple  blocks  world.  Even  reinforcement  learning 


work  that  involves  generalization  has  largely  employed 
an  attribute-value  representation.  Furthermore,  due 
to  the  use  of  variables  in  relational  representations,  it 
is  possible  to  make  abstractions  of  some  specific  details 
of  the  learning  tasks,  such  as  the  goal  pursued.  Indeed, 
when  learning  to  plan  in  the  blocks  world,  one  would 
expect  that  the  results  of  learning  how  to  stack  block 
a  onto  block  b  would  be  similar  to  stacking  c  onto  d. 
Current  approaches  to  reinforcement  learning  have  to 
retrain  from  scratch  if  the  goal  is  changed  in  this  man¬ 
ner.  Using  relational  reinforcement  learning  retraining 
is  unnecessary.  Relational  reinforcement  learning  also 
allows  us  to  exploit  the  results  of  learning  in  a  simple 
domain  when  learning  in  a  more  complex  domain  (e.g., 
going  from  3  blocks  to  4  blocks  in  the  blocks  world). 

From  the  inductive  logic  programming  point  of  view, 
it  is  important  to  address  domains  such  as  reinforce¬ 
ment  learning.  So  far,  inductive  logic  programming 
has  mainly  studied  concept-learning,  and  largely  ig¬ 
nored  the  rest  of  machine  learning.  By  demonstrating 
the  potential  of  relational  representations  for  reinforce¬ 
ment  learning,  we  hope  to  show  that  the  relational 
learning  methodology  does  not  only  apply  to  concept¬ 
learning  but  to  the  whole  field  of  machine  learning. 

With  this  in  mind,  we  present  a  preliminary  ap¬ 
proach  to  relational  reinforcement  learning  and  ap¬ 
ply  it  to  simple  planning  tasks  in  the  blocks  world. 
The  planning  task  involves  learning  a  policy  to  select 
actions.  Learning  is  necessary  as  the  planning  agent 
does  not  know  the  effects  of  its  actions.  Relational  re¬ 
inforcement  learning  employs  the  Q-learning  method 
[14,  8,  11]  where  the  Q-function  is  learned  using  a  re¬ 
lational  regression  tree  algorithm  (see  [6,  9]).  A  state 
is  represented  relationally  as  a  set  of  ground  facts.  A 
relational  regression  tree  in  this  context  takes  as  input 
a  relational  description  of  a  state,  a  goal  and  an  action, 
and  produces  the  corresponding  Q-value. 
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This  paper  is  organized  as  follows.  In  section  2,  we 
view  planning  (under  uncertainty)  as  a  reinforcement 
learning  task,  and  in  section  3,  we  briefly  review  re¬ 
inforcement  and  in  particular  Q-learning.  Section  4 
introduces  relational  reinforcement  learning  that  com¬ 
bines  Q-learning  and  logical  regression  trees.  In  sec¬ 
tion  5,  we  present  some  experiments,  and  Anally,  in 
section  6,  we  conclude  and  touch  upon  related  work. 

2  LEARNING  TO  PLAN  AS 

REINFORCEMENT  LEARNING 

Consider  a  planning  agent  with  the  following  task: 
Given 

•  a  set  of  possible  states  S, 

•  a  set  of  possible  actions  A, 

•  an  UNKNOWN  function  S:  S  x  A A, 

•  a  function  pre:S  x  A-^  {t,  /}, 

•  a  goal  goahS  -4  {t,  /},  and 

•  a  starting  state  s  £  S, 

find  a  sequence  of  actions  ai, ...,  «„  (oj  €  A)  such  that 

•  goal{S{...S{s,ai))...),an))  —  t,  and 

•  pre{S{...6{s,ai))...),  ...ai))  =t. 

The  agent  can  be  in  one  of  the  states  of  S.  It  can  exe¬ 
cute  action  a  €  A  in  a  given  state  s  if  the  preconditions 
for  a  are  true  in  s  {pre{s,a)  =  t),  e.g.,  as  in  STRIPS 
[7].  Executing  an  action  a  in  a  state  s  will  put  the 
agent  in  a  new  state  6{s,a).  When  placed  in  a  state  s 
the  task  of  the  agent  is  to  find  a  (shortest)  sequence 
of  actions  ai,...,an  that  will  lead  it  to  a  goal  state. 
The  prototypical  AI  task  belonging  to  this  category  is 
planning. 

It  is  assumed  here  that  the  agent  does  not  know  the 
effect  of  its  actions,  hence  the  function  d  is  unknown 
to  the  agent.  The  above  task  specification  thus  con¬ 
trasts  with  classical  planning  in  that  the  S  function  is 
unknown  to  the  agent.  Therefore,  this  task  requires  a 
learning  component. 

Example:  The  best  known  (toy)-domain  to  study 
planning  is  the  blocks  world.  Consider  the  situ¬ 
ation  where  we  have  three  blocks  called  a,  b  and 


c,  and  the  floor.  Blocks  can  be  on  the  floor  or 
can  be  stacked  on  each  other.  Each  state  can 
be  described  by  a  set  (list)  of  facts,  e.g.,  si  = 
{clear (a),  on{a,b),on{b,c),on{c,  floor)}.  The  avail¬ 
able  actions  are  then  move{x,y)  where  x  ^  y  and 
X  e  {a,b,c},  y  £  {a,  b,  c,  floor}. 

It  is  then  possible  to  define  the  preconditions  and  ef¬ 
fects  of  actions.  The  Prolog  code  below  defines  pre 
and  6  respectively.  The  predicate  pre  defines  the  pre¬ 
conditions  for  the  action  move(X,Y)  while  the  predi¬ 
cate  delta  defines  its  effects:  deltafS ,  A ,  SI)  succeeds 
when  <5(5,  A)  =  SI.  States  are  represented  as  lists  of 
facts  and  the  auxiliary  predicate  holds  (S ,  Query)  suc¬ 
ceeds  when  Query  would  succeed  in  the  knowledge  base 
containing  the  facts  in  S  only. 

pre (S, move (X,Y)) 

holds (S, [clear (X) ,  clear (Y) , 

not  X=Y,  not  on(X,f loor)] ) . 
pre(S,move(X,Y)) 

holds (S, [clear (X),  clear(Y), 

not  X=Y,  on(X,floor)] ) . 
pre (S, move (X, floor)) 

holds (S, [clear (X) ,  not  on(X, floor)]) . 

holds  (S,  [])  . 

holds(S,[  not  X=Y  I  R  ]) 
not  X=Y,  !,  holds(S,R). 
holds (S,[  not  A  I  R  ]) 

not  member(A,S),  holds(S,R). 
holds(S,[A  I  R]) 

member ( A, R) ,  holds(S,R). 

delta(S,move(X,Y) ,  NextS) 
holds (S , [clear (X) ,  clear (Y) , 

not  X=Y,  not  on(X,floor)] ) , 
delete ( [clear (Y) , on (X , Z) ] , S , SI)  , 
add(  [clear (Z) ,on(X,Y)] .SI, NextS) . 
delta(S,move(X,Y) ,  NextS) 
holds (S,  [cleeurfX) ,  clear (Y) , 

not  X=Y,  on(X,floor)] ) , 
delete ( [clear (Y) ,on(X, floor)] ,S,S1)  , 
add([on(X,Y)] .SI. NextS) . 
deltafS, move (X.f loor) ,  NextS) 

holds (S, [clear (X) ,  not  on(X,floor)] ) , 

delete (  [on (X.Z)]  ,S.S1), 

add( [clear(Z) , on (X, floor)] , SI, NextS) . 

The  goal  is  to  stack  a  onto  b,  i.e., 
goal(S)  member (on (a, b) ,S).  □ 


3  REINFORCEMENT  LEARNING 

Planning  with  incomplete  knowledge  as  outlined  above 
can  be  recast  as  a  reinforcement  learning  problem. 
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3.1  THE  BASICS  OF  REINFORCEMENT 
LEARNING 

The  basic  notions  of  reinforcement  learning  can  be 
outlined  as  follows  (we  follow  the  notation  used  by 
Mitchell  [11]). 

•  The  task  of  the  agent  is  to  learn  a  policy  tt  :  S 

A  for  selecting  its  next  action  at  based  on  the 
current  state  sf,  that  is  7r(sf)  =  at. 

•  The  reward  at  time  t  is  rt  =  r{st,at).  We  will 
assume  here  that  rt  =  1  if  goal{6{st,ai))  =  t  and 
St  <5(st,at);  otherwise  rt  =  0.  The  reward  func¬ 
tion  r  is  unknown  to  the  learner  as  it  relies  on 
the  unknown  S.  The  reward  function  only  gives  a 
reward  in  goal  states. 

•  The  state  at  time  t  -f  1  is  St+i  =  ^(st,at)  if 
goal{si)  =  /;  otherwise  St+i  =  st-  This  captures 
the  idea  that  goal  states  are  absorbing  states,  i.e., 
once  a  goal  state  is  reached  the  only  available  ac¬ 
tion  is  to  stay  in  the  state. 

•  The  learned  policy  should  be  optimal,  i.e.,  it 
should  maximize 

CC 

i=o 

where  0  <  7  <  1.  We  will  denote  the  optimal 
policy  by  tt*. 

The  optimal  policy  tt*  allows  us  to  compute  the  short¬ 
est  plan  to  reach  a  goal  state.  So,  learning  the  optimal 
policy  (or  approximations  thereof)  will  allow  us  to  im¬ 
prove  our  planning  performance. 

3.2  Q-LEARNING 

It  is  well-known  that  under  the  conditions  sketched  in 
the  previous  subsection,  Q-learning  allows  us  to  ap¬ 
proximate  the  optimal  policy. 

The  optimal  policy  tt*  will  always  select  the  action 
that  maximizes  the  sum  of  the  immediate  reward  and 
the  value  of  the  immediate  successor  state,  i.e., 

TT*{s)  =  argmaxa{r{s,a)  -1-7^’'  (S(s,a))) 

The  problem  with  this  formulation  of  tt*  is  that  it  re¬ 
quires  knowledge  of  S  and  r,  which  the  learner  does 
not  have  at  its  disposal. 

The  Q-function  is  defined  as  follows  : 

(5(s,a)  =  r(s,a)  +  7^’"  (S(s,a)) 


Knowing  Q  allows  us  to  rewrite  the  definition  of  tt*  as 
follows  : 

7r*(s)  =  argmaXaQ{s,a) 

According  to  Mitchell,  this  rewrite  is  important  as  it 
shows  that  if  the  agent  can  learn  the  Q  function  instead 
of  the  function,  it  will  be  able  to  act  optimally. 
The  Q-function  for  a  fixed  goal  can  then  be  approxi¬ 
mated  by  Q,  for  which  a  look-up  table  is  learned  by 
the  following  algorithm  (cf.  [11]). 

for  each  s,  a  do 

initialize  the  table  entry  Q(s,a)  —  0 
do  forever 
i  :=  0 

generate  a  random  state  sq 
while  not  goal{si)  do 

select  an  action  Oj  and  execute  it 
receive  an  immediate  reward  rj  =  r{si,ai) 
observe  the  new  state  Si+i 
i:=i-(-l 

for  j=i-l  to  0  do 

update  Q{sj,aj)  :=  r*  -f  '^maXa'Q{sj+-[,a') 

It  is  common  in  Q-learning  to  select  action  a  in  state 
s  probabilistically  so  that  P(o|s)  is  proportional  to 
Q(s,a),  e.g., 

P{ai\s)  =  V  X!  (1) 

3 

Higher  values  of  k  give  stronger  preference  to  actions 
with  high  values  of  Q  causing  the  agent  to  exploit  what 
it  has  learned,  while  lower  values  of  k  reduce  this  pref¬ 
erence  allowing  the  agent  to  explore  actions  that  cur¬ 
rently  do  not  have  high  values  of  Q. 

4  RELATIONAL 

REINFORCEMENT  LEARNING 

4.1  THE  NEED  FOR  RELATIONAL 
REPRESENTATIONS 

Given  the  above  classical  framework  for  Q-learning  we 
could  now  learn  to  plan  in  the  blocks  world  sketched 
earlier.  Using  the  approach  as  it  stands  we  could 
store  all  the  state-action  pairs  encountered  and  mem¬ 
orize/update  the  corresponding  Q  values,  having  in 
effect  an  explicit  look-up  table  for  state-action  pairs. 
This  has  however  a  number  of  disadvantages: 
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move(c, floor) 


move(b,c) 


Figure  1:  A  blocks- world  example  for  relational  Q-learning. 


•  It  is  impractical  for  all  but  the  smallest  state- 
spaces.  Furthermore,  using  look-up  tables  does 
not  work  for  infinite  state  spaces  which  could  arise 
when  first  order  representations  are  used  (e.g.,  if 
the  number  of  blocks  in  the  world  is  unkown  or 
infinite  the  above  method  does  not  work). 

•  Despite  the  use  of  a  relational  representation  for 
states  and  actions,  the  above  method  is  unable 
to  capture  the  structural  aspects  of  the  planning 
task. 

•  Whenever  the  goal  is  changed  from  say  on{a,  b)  to 
on(6,  c)  the  above  method  would  require  retrain¬ 
ing  the  whole  Q  function. 

•  Ideally,  one  would  expect  that  the  results  of  learn¬ 
ing  in  a  world  with  3  blocks  could  be  (partly)  re¬ 
cycled  when  learning  in  a  4  blocks  world  later  on. 
It  is  unclear  how  to  achieve  this  with  the  lookup 
table. 

The  first  problem  can  be  solved  by  using  an  inductive 
learning  algorithm  (e.g.,  a  neural  network)  to  approx¬ 
imate  Q.  The  three  other  problems  can  only  be  solved 
by  using  a  relational  learning  algorithm  that  can  make 
abstraction  of  the  specific  blocks  and  goals  using  vari¬ 
ables.  We  now  present  such  a  relational  learning  algo¬ 
rithm. 

4.2  THE  RRL  ALGORITHM 

The  relational  reinforcement  learning  (RRL)  algo¬ 
rithm  is  obtained  by  combining  the  classical  Q- 
learning  algorithm  with  stochastic  selection  of  actions 
and  a  relational  regression  algorithm.  Instead  of  hav¬ 
ing  an  explicit  lookup  table,  an  implicit  representation 
of  the  Q-function  is  learned  in  the  form  of  a  logical  re¬ 
gression  tree,  called  a  Q-tree. 


The  main  point  where  RRL  differs  from  the  algorithm 
in  section  3.2  is  in  the  for-loop  where  the  Q  function 
is  modified.  This  for-loop  now  becomes  : 

for  j=i-l  to  0  do 

generate  example  {sj,aj,qj), 
where  qj  :=  -I-  'ymaXa'Qe{sj+i,a') 

update  Qe  using  TILDE-RT 
to  produce  Oe+i  using  the  examples  {sj,aj,qj) 

TILDE-RT  [6]  is  an  algorithm  for  learning  logical  re¬ 
gression  trees  and  will  be  described  briefly  below. 

The  initial  tree  Qo  assigns  zero  value  to  all  state-action 
pairs.  From  each  goal  state  g  encountered,  an  example 
(g,o,0)  is  generated  for  each  action  a  whose  precondi¬ 
tions  are  satisfied  in  g.  The  rationale  for  this  is  that 
no  reward  can  be  expected  from  applying  an  action  in 
an  absorbing  goal  state. 

Example:  A  possible  initial  episode  (e  =  0)  in  the 
blocks  world  with  three  blocks  a,  b,  and  c,  where  the 
goal  is  to  stack  a  on  6  (i.e.,  goal(on(a,  b)))  is  depicted  in 
Figure  1.  The  discount  factor  7  is  0.9  and  the  reward 
given  is  one  on  achieving  a  goal  state,  zero  otherwise. 

The  examples  generated  by  RRL  use  the  actions  and 
the  Q-values  listed  above  the  arrows  representing  the 
actions.  The  actual  format  of  these  examples  is  listed 
in  Table  1.  It  is  exactly  this  input  that  would  be  used 
by  TILDE-RT  to  generate  the  Q-tree  Qi.  □ 

TILDE-RT  is  not  incremental,  so  we  currently  simu¬ 
late  the  update  of  Q  by  keeping  all  (s,  a)  pairs  encoun¬ 
tered  and  the  most  recent  q  value  for  each  pair,  and 
inducing  a  relational  regression  tree  Qe  from  these  ex¬ 
amples  after  each  episode  e.  This  tree  is  then  used  to 
select  actions  in  episode  e  -I- 1. 
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Table  1;  Examples  for  TILDE- RT  generated  from  the  blocks- world  Q-learning  episode  in  Figure  1. 


qvalue (0 . 81) . 

qvalue (0.9) . 

qvalue (1 .0) . 

qvalue (0.0) . 

action (move (c, floor)) . 

action (move(b,c))  . 

action(move(a,b)) . 

action(move(  a, floor)). 

goal(on(a,b)) . 

goal(on(a,b))  . 

goal(on(a,b)) . 

goal(on(a,b)) . 

clear(c) . 

clear (b) . 

clear (a) . 

clear (a) . 

on(c ,b) . 

clear (c) . 

clear (b) . 

on(a,b) . 

on(b,a) . 

on(b,a) . 

on(b,c) . 

on(b,c) . 

on(a, floor) . 

on(a, floor) . 
on(c, floor) . 

on(a, floor) . 
on(c, floor) . 

on(c, floor) . 

4.3  TOP-DOWN  INDUCTION  OF 
LOGICAL  REGRESSION  TREES 

Logical  regression  trees  are  similar  to  propositional  re¬ 
gression  trees  [3]:  leaves  predict  a  value  for  a  continu¬ 
ous  class,  while  internal  nodes  contain  conditions  that 
partition  the  example  space.  The  difference  is  that 
examples  here  are  not  feature  or  attribute-value  vec¬ 
tors,  but  sets  of  relational  facts,  representing,  e.g.,  a 
state  of  the  blocks  world,  a  goal,  and  an  action  to  be 
taken,  all  at  the  same  time.  Similarly,  internal  nodes 
are  not  restricted  to  attribute-value  tests  but  can  be 
first  order  literals  containing  predicates,  variables  and 
complex  terms. 


The  TILDE- RT  system  [6]  induces  such  first  order  logi¬ 
cal  regression  trees  (or  relational  regression  trees)  from 
examples  (cf.  [9]  for  a  related  approach).  The  input 
for  TILDE-RT  is  a  set  of  state-action  pairs  together 
with  the  corresponding  Q- values,  represented  as  sets  of 
facts.  From  this  TILDE-RT  induces  (using  the  classi¬ 
cal  TDIDT-algorithm)  a  tree  in  which  the  classes  cor¬ 
respond  to  real  numbers  (Q-values). 


To  illustrate  the  above  notions,  consider  the  episode 
shown  in  Figure  1.  The  examples  for  TILDE-RT  gen¬ 
erated  by  the  RRL  algorithm  are  given  in  Table  1.  The 
relational  regression  tree  induced  by  TILDE-RT  from 
these  examples  is  shown  in  Figure  2. 


Nodes  in  the  tree  correspond  to  Prolog- queries.  If 
the  query  succeeds  in  an  example  the  yes  subtree  is 
taken,  otherwise  the  no  subtree.  Different  nodes  in 
the  tree  may  share  variables,  e.g.,  the  bottom  node 
in  the  tree  (containing  act  ion  (move  (D,B)))  refers  to 
the  variable  D  that  first  appear  in  the  root  of  the  tree 
(goal(on(C,D))).  The  Prolog  program  corresponding 
to  the  tree  is  shown  in  the  lower  part  of  Figure  2. 


The  semantics  of  logical  decision  trees  is  extensively 
discussed  in  [1],  as  well  as  the  correspondence  between 
a  tree  and  a  Prolog  program.  The  method  to  induce 
the  trees  is  described  in  [6]  and  is  -  for  the  case  of 
regression  trees  -  very  similar  to  Kramer’s  SRT  system 
[9].  We  refer  to  these  papers  for  more  details  on  the 
representation  and  learning  of  such  trees. 

To  find  the  Q-value  corresponding  to  a  state-action 
pair,  one  has  to  construct  a  Prolog  knowledge  base 
containing  the  Prolog  program  (corresponding  to  the 
tree),  all  facts  in  the  state,  the  action,  and  the  goal. 
Running  the  query  ?-qvalue(Q)  will  then  return  the 
desired  result.  E.g.,  the  Q-tree  above  will  return  a  Q- 
value  of  zero  for  all  actions  if  the  goal  is  on(C,D)  and 
on(C,D)  holds  in  the  state  (goal  states  are  absorbing). 
On  the  other  hand,  if  the  goal  on(C,D)  does  not  yet 
hold  and  the  action  is  move(C,D)),  then  a  Q-value  of 
one  is  returned  (reward  of  one  for  achieving  a  goal 
state). 


action (move (A, B))  ,  goal(on(C,D)) 
on(C,D)  ? 

+ — yes :  [0] 

+ — no;  act ion (move (C,D))  ? 

+ — yes:  [1] 

+ — no:  action(move(D,B))  ? 
+ — yes:  [0.9] 

+ — no:  [0.81] 


qvalue (0)  : - 

action(move(A,B))  ,  goal(on(C,D))  , 
on(C,D).  !. 
qvalue (1) 

action(move(A,B))  ,  goal (on(C,D) )  , 
action(move(C,D)) ,  !. 
qvalue (0.9) 

action(move(A,B))  ,  goal(on(C,D))  , 
action(move(D,B)) ,  !. 
qvalue (0.81) . 

Figure  2:  A  relational  regression  tree  generated  by 
TILDE-RT  from  the  examples  in  Table  1  and  its  equiv¬ 
alent  Prolog  program. 
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action(move(A,B))  ,  goal(on(C,D) ) 
on(C,D)  ? 

+ — yes:  [0] 

+ — no:  act ion (move (C,D))  ? 

+ — yes:  [1] 

+ — no:  on(B,C)  ? 

+ — yes:  [0.729] 

+ — no:  on(B,D)  ? 

+ — yes:  [0.729] 

+ — no:  act ion (move ( A, C))  ? 

+ — yes:  [0.81] 

+ — no:  action(move(A,D) )  ? 

+ — yes:  [0.81] 

+ — no:  clear (D)  ? 

+ — yes:  on(C,B)  ? 


+ — yes:  on(A,C) 

7 

1  + — yes: 

[0.9] 

1  + — no : 

clecir(C)  ? 

1 

+ — yes:  [0.9] 

1 

+ — no:  [0.81] 

+ — no:  [0.9] 

clear (C)  ? 

+ — yes:  on(C,B) 

7 

1  + — yes: 

[0.9] 

1  + — no : 

[0.81] 

+ — no:  [0.81] 

Figure  3:  The  Q-tree  generated  by  RRL  in  the  3  blocks  world  after  10  episodes. 


5  EXPERIMENTS 

We  applied  the  RRL  algorithm  described  above  to 
learn  how  to  stack  one  block  onto  another  in  worlds 
with  three  and  four  blocks,  respectively.  In  particular, 
the  goal  to  achieve  was  on(o,  6),  the  two  other  blocks 
being  c  and  d.  An  example  episode  in  the  three  blocks 
world  is  depicted  in  Figure  1. 


The  discount  factor  7  had  the  value  0.9.  When  select¬ 
ing  states  stochastically  according  to  equation  1,  the 
constant  k  was  set  to  e°'^.  Examples  for  learning  Q- 
trees  were  generated  after  each  episode,  as  described 
in  the  section  above. 


TILDE-RT  was  used  to  induce  an  updated  Q-tree  after 
each  episode.  The  minimal  number  of  cases  in  a  leaf 
was  set  to  one  and  TILDE-RT  generated  unpruned 
trees,  which  exactly  reproduce  the  Q- values  for  the 
state-action  pairs  seen  during  the  learning  phase. 


Using  the  above  settings,  the  RRL  algorithm  was  first 
run  for  10  episodes  in  the  3  blocks  world.  The  tree 
shown  in  Figure  3  was  generated  by  TILDE-RT  after 
the  final  episode.  This  tree  represents  the  optimal  pol¬ 
icy  for  the  given  reinforcement  learning  problem.  The 
top  two  levels  of  the  tree  match  those  of  the  tree  in 
Table  1,  which  was  generated  from  a  single  episode. 

It  is  important  to  note  that  the  individual  blocks  are 
not  referred  to  in  the  tree  itself  directly,  but  only 
through  the  variables  of  the  goal.  This  means  that  the 
tree  represents  the  optimal  policy  not  only  for  achiev¬ 
ing  the  goal  on{a,b),  but  also  on{b,c)  and  on{c,a). 
This  is  one  of  the  major  advantages  of  using  a  relation 
representation  for  Q-learning. 

The  Q-tree  obtained  after  10  episodes  in  the  4  blocks 
worlds  was  much  larger  (44  nodes  as  opposed  to  the  12 
nodes  of  the  3-blocks  Q-tree).  It  also  represents  an  op¬ 
timal  policy:  it  chooses  a  shortest  path  to  a  goal  state 
from  all  initial  states,  if  the  action  with  the  highest 
Q-value  is  always  selected. 
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The  3  top  levels  of  the  tree  match  with  the  tree  from 
the  3  blocks  world.  This  indicates  that  the  result  of 
learning  in  the  3  blocks  world  could  be  used  to  boot¬ 
strap  learning  in  the  4  blocks  world.  Indeed,  if  we  take 
the  Q-tree  learned  in  the  3  blocks  world  shown  in  Fig¬ 
ure  3  and  use  it  to  select  actions  in  the  4  blocks  world, 
it  selects  an  optimal  path  to  a  goal  state  from  all  but 
9  of  the  73  possible  initial  states.  In  4  of  the  9  cases  a 
looping  behavior  is  produced,  in  the  remaining  5  cases 
one  extra  action  is  needed  as  compared  to  an  optimal 
plan. 

Using  the  Q-tree  from  Figure  3  to  bootstrap  RRL  in 
the  4  blocks  world  helps  improve  performance,  espe¬ 
cially  in  the  initial  episodes.  Without  bootstrapping, 
after  two  episodes  a  tree  is  learned  which  produces 
nonoptimal  behavior  in  12  of  the  73  initial  states. 
With  bootstrapping,  the  behavior  of  the  learned  tree  is 
nonoptimal  for  8  of  the  73  possible  initial  states.  After 
ten  episodes,  the  learned  Q-tree  produces  optimal  be¬ 
havior  and  is  much  smaller  (27  nodes)  as  compared  to 
the  Q-tree  learned  without  bootstrapping  (44  nodes). 

6  DISCUSSION 

We  have  presented  an  approach  to  planning  with 
incomplete  knowledge  that  combines  reinforcement 
learning  and  relational  regression  into  a  technique 
called  relational  reinforcement  learning.  The  advan¬ 
tages  of  this  approach  include  the  ability  to  use  struc¬ 
tured  representations,  which  enables  us  to  also  de¬ 
scribe  infinite  worlds,  and  the  ability  to  use  variables, 
which  allows  us  to  abstract  away  from  specific  details 
of  the  situations  (such  as,  e.g.,  the  goal).  The  ability 
to  use  results  of  simpler  tcisks  to  bootstrap  learning  in 
more  complex  tasks  is  also  an  advantage  worth  men¬ 
tioning.  Finally,  it  is  easy  to  incorporate  nondetermin- 
istic  actions  within  the  proposed  approach. 

Even  for  standard  reinforcement  learning,  scaling-up 
as  the  dimensionality  of  the  problem  increases  can  be 
a  problem.  Using  a  richer  description  language  may 
seem  to  make  things  even  worse.  However,  there  are 
reasons  to  expect  that  using  a  richer  representation  ac¬ 
tually  enables  relational  Q-learning  to  scale-up  better 
than  standard  Q-learning.  Let  us  illustrate  these  on 
the  blocks  world. 

First,  in  the  representation  employed,  the  relational 
theories  learned  abstract  away  the  block  names,  caus¬ 
ing  the  number  of  states  that  are  essentially  differ¬ 
ent  to  decrease.  For  instance,  with  goal{on{a,b)) 
the  states  {on{a,c),on{c,b),on{b,  floor), on{d,  floor)} 
and  {on{a,d),on{d,b),on{b,  floor),  on{c,  floor)}  are 


essentially  the  same  as  c  and  d  are  interchangeable. 
In  standard  Q-learning,  they  would  be  considered  dif¬ 
ferent.  In  our  4-blocks  example,  the  number  of  states 
that  essentially  differ  from  one  another  is  73  for  a  stan¬ 
dard  Q-learner,  but  only  38  for  a  relational  one.  This 
ratio  increases  combinatorially  (since  all  blocks  that 
do  not  occur  in  the  goal  have  no  special  status  and  are 
thus  interchangeable,  the  ratio  increases  roughly  with 
(n  —  2)!,  where  n  is  the  total  number  of  blocks). 

Second,  the  use  of  background  knowledge  makes  it  pos¬ 
sible  to  abstract  even  further  from  specific  situations 
that  do  not  essentially  differ.  For  instance,  when  a 
has  to  be  cleared  in  order  to  be  able  to  move  it,  it  is 
not  essential  whether  there  are  1,  5  or  17  blocks  above 
a:  the  top  of  the  stack  on  a  should  be  moved.  Using 
background  definitions  such  as  above(X,Y)  (the  recur¬ 
sive  closure  of  on(X,Y))  it  is  possible  to  state  a  rule 
such  as  ”if  there  are  blocks  on  a,  move  the  topmost  of 
those  blocks  to  the  floor”  which  captures  a  very  large 
set  of  specific  cases. 

However,  the  exact  scale-up  behavior  of  relational  re¬ 
inforcement  learning  has  still  to  be  determined  ex¬ 
perimentally.  The  experimental  evaluation  of  our  ap¬ 
proach  done  so  far  is  preliminary  and  is  mainly  in¬ 
tended  to  highlight  the  principal  advantages  of  using 
a  relational  representation  for  reinforcement  learning. 
We  hope  that  this  paper  will  inspire  further  research 
into  the  combination  of  relational  and  reinforcement 
learning,  as  much  work  remains  to  be  done.  This 
includes  work  in  the  line  of  proper  performance  as¬ 
sessment,  both  in  terms  of  standard  performance  tests 
in  reinforcement  learning  fashion  (root  mean  square 
errors  of  learned  Q-values  wrt.  the  Q-values  of  the 
optimal  policy)  and  in  considering  more  complex  and 
demanding  planning  problems. 

More  complex  problems  can  be  obtained  by  increasing 
the  number  of  blocks  in  the  world,  considering  more 
complex  goals,  such  as  building  a  stack  of  all  available 
blocks,  and  considering  problems  outside  the  blocks 
world. 


This  work  is  related  to  work  on  generalization  in  re¬ 
inforcement  learning,  which  has  however  mainly  ad¬ 
dressed  the  use  of  neural  networks  for  this  purpose  [13]. 
The  closest  related  work  is  probably  Chapman’s  and 
Kaelbling’s  decision  tree  algorithm  that  was  specif¬ 
ically  designed  for  reinforcement  learning  [5].  Note 
however  that  our  approach  is  distinguished  from  the 
mainstream  work  in  reinforcement  learning  by  the  use 
of  a  relational  representation. 
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Relational  representations  are  commonly  used  in  plan¬ 
ning  approaches.  There  have  also  been  some  at¬ 
tempts  to  combine  planning  with  relational  learning 
within  those  approaches,  e.g.,  within  the  PRODIGY 
approach  [2].  Our  approach  is  related  to  them  through 
the  use  of  a  relational  representation.  However,  it 
seems  that  the  combination  of  planning,  reinforcement 
learning  and  relational  learning  has  not  been  addressed 
so  far. 

The  reinforcement  learning  part  of  the  work  presented 
in  this  paper  is  admittedly  simple.  We  have  taken  a 
standard  textbook  description  of  reinforcement  learn¬ 
ing  [11]  and  incorporated  an  implementation  of  it 
within  our  approach.  We  have  considered  a  deter¬ 
ministic  setting  and  a  goal-oriented  formulation  of  the 
learning  problem.  However,  both  restrictions  can  be 
easily  lifted  to  extend  to  non-zero  rewards  on  non¬ 
terminal  states  (the  RRL  algorithm  actually  makes  no 
assumption  on  the  reinforcement  received)  and  non- 
deterministic  actions.  To  handle  nondeterministic  ac¬ 
tions  an  appropriate  update  rule  (see  page  382  of  [11]) 
has  to  be  used  to  generate  examples  for  the  TILDE- 
RT  algorithm.  Other  points  where  the  reinforcement 
learning  part  can  be  improved  include  the  initializa¬ 
tion  of  Q  values  and  the  exploration  strategy. 

The  current  implementation  of  TILDE-RT  is  -  accord¬ 
ing  to  reinforcement  standards  -  not  optimal.  One  of 
the  reasons  is  that  it  is  not  incremental.  However,  in- 
crementality  is  not  enough,  as  the  (estimated)  values 
of  Q  are  changing  with  time.  These  problems  are  taken 
care  of  within  the  Chapman  and  Kaelbling’s  decision 
tree  algorithm  that  was  specifically  designed  for  rein¬ 
forcement  learning  [5].  A  natural  direction  for  further 
work  is  thus  to  develop  a  first  order  regression  tree  al¬ 
gorithm  combining  the  representations  of  TILDE-RT 
with  the  algorithm  and  performance  measures  of  the 
approach  by  Chapman  and  Kaelbling.  Such  an  in¬ 
tegrated  approach,  which  is  currently  under  develop¬ 
ment,  would  not  suffer  from  the  abovementioned  prob¬ 
lems. 
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Abstract 

The  two  dominant  schemes  for  rule-learning, 
C4.5  and  RIPPER,  both  operate  in  two 
stages.  First  they  induce  an  initial  rule  set 
and  then  they  refine  it  using  a  rather  com¬ 
plex  optimization  stage  that  discards  (C4.5) 
or  adjusts  (RIPPER)  individual  rules  to 
make  them  work  better  together.  In  con¬ 
trast,  this  paper  shows  how  good  rule  sets 
can  be  learned  one  rule  at  a  time,  with¬ 
out  any  need  for  global  optimization.  We 
present  an  algorithm  for  inferring  rules  by 
repeatedly  generating  partial  decision  trees, 
thus  combining  the  two  major  paradigms 
for  rule  generation — creating  rules  from  de¬ 
cision  trees  and  the  separate-and-conquer 
rule-learning  technique.  The  algorithm  is 
straightforward  and  elegant:  despite  this,  ex¬ 
periments  on  standard  datasets  show  that  it 
produces  rule  sets  that  are  as  accurate  as  and 
of  similar  size  to  those  generated  by  C4.5, 
and  more  accurate  than  RIPPER’s.  More¬ 
over,  it  operates  efficiently,  and  because  it 
avoids  postprocessing,  does  not  suffer  the  ex¬ 
tremely  slow  performance  on  pathological  ex¬ 
ample  sets  for  which  the  C4.5  method  has 
been  criticized. 


1  Introduction 

If-then  rules  are  the  basis  for  some  of  the  most  popular 
concept  description  languages  used  in  machine  learn¬ 
ing.  They  allow  “knowledge”  extracted  from  a  dataset 
to  be  represented  in  a  form  that  is  easy  for  people  to 
understand.  This  gives  domain  experts  the  chance  to 
analyze  and  validate  that  knowledge,  and  combine  it 
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with  previously  known  facts  about  the  domain. 

A  variety  of  approaches  to  learning  rules  have  been 
investigated.  One  is  to  begin  by  generating  a  deci¬ 
sion  tree,  then  to  transform  it  into  a  rule  set,  and 
finally  to  simplify  the  rules  (Quinlan,  1987a);  the  re¬ 
sulting  rule  set  is  often  more  accurate  than  the  original 
tree.  Another  is  to  use  the  “separate-and-conquer” 
strategy  (Pagallo  k  Haussler,  1990)  first  applied  in 
the  AQ  family  of  algorithms  (Michalski,  1969)  and 
subsequently  used  as  the  basis  of  many  rule  learning 
systems  (Fiirnkranz,  1996).  In  essence,  this  strategy 
determines  the  most  powerful  rule  that  underlies  the 
dataset,  separates  out  those  examples  that  are  covered 
by  it,  and  repeats  the  procedure  on  the  remaining  ex¬ 
amples. 

Two  dominant  practical  implementations  of  rule- 
lecirners  have  emerged  from  these  strands  of  research: 
C4.5  (Quinlan,  1993)  and  RIPPER  (Cohen,  1995). 
Both  perform  a  global  optimization  process  on  the  set 
of  rules  that  is  induced  initially.  The  motivation  for 
this  in  C4.5  is  that  the  initial  rule  set,  being  gener¬ 
ated  from  a  decision  tree,  is  unduly  large  and  redun¬ 
dant:  C4.5  drops  some  individual  rules  (having  pre¬ 
viously  optimized  rules  locally  by  dropping  conditions 
from  them).  The  motivation  in  RIPPER,  on  the  other 
hand,  is  to  increase  the  accuracy  of  the  rule  set  by  re¬ 
placing  or  revising  individual  rules.  In  either  case  the 
two-stage  nature  of  the  algorithm  remains:  as  Cohen 
(1995)  puts  it,  “...  both  RIPPERfcand  C4.5rules  start 
with  an  initial  model  and  iteratively  improve  it  using 
heuristic  techniques.”  Experiments  show  that  both 
the  size  and  the  performance  of  rule  sets  are  signifi¬ 
cantly  improved  by  post-induction  optimization.  On 
the  other  hand,  the  process  itself  is  rather  complex  and 
heuristic. 

This  paper  presents  a  rule-induction  procedure  that 
avoids  global  optimization  but  nevertheless  produces 
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accurate,  compact  rule  sets.  The  method  combines 
the  two  rule  learning  paradigms  identified  above.  Sec¬ 
tion  2  discusses  these  two  paradigms  and  their  incar¬ 
nation  in  C4.5  and  RIPPER.  Section  3  presents  the 
new  algorithm,  which  we  call  “PART”  because  it  is 
based  on  partial  decision  trees.  Section  4  describes  an 
experimental  evaluation  on  standard  datasets  compar¬ 
ing  PART  to  C4.5,  RIPPER,  and  C5.0,  the  commercial 
successor  of  C4.5.^  Section  5  summarizes  our  findings. 

2  Related  Work 

We  review  two  basic  strategies  for  producing  rule  sets. 
The  first  is  to  begin  by  creating  a  decision  tree  and 
then  transform  it  into  a  rule  set  by  generating  one 
rule  for  each  path  from  the  root  to  a  leaf.  Most  rule 
sets  derived  in  this  way  can  be  simplified  dramatically 
without  losing  predictive  accuracy.  They  are  unnec¬ 
essarily  complex  because  the  disjunctions  that  they 
imply  can  often  not  be  expressed  succinctly  in  a  deci¬ 
sion  tree.  This  is  sometimes  known  as  the  “replicated 
subtree”  problem  (Pagallo  &  Haussler,  1990). 

When  obtaining  a  rule  set,  C4.5  first  transforms  an 
unpruned  decision  tree  into  a  set  of  rules  in  the  afore¬ 
mentioned  way.  Then  each  rule  is  simplified  separately 
by  greedily  deleting  conditions  in  order  to  minimize  the 
rule’s  estimated  error  rate.  Following  that,  the  rules 
for  each  class  in  turn  are  considered  and  a  “good” 
subset  is  sought,  guided  by  a  criterion  based  on  the 
minimum  description  length  principle  (Rissanen,  1978) 
(this  is  performed  greedily,  replacing  an  earlier  method 
that  used  simulated  annealing).  The  next  step  ranks 
the  subsets  for  the  different  classes  with  respect  to  each 
other  to  avoid  conflicts,  and  determines  a  default  class. 
Finally,  rules  are  greedily  deleted  from  the  whole  rule 
set  one  by  one,  so  long  as  this  decreases  the  rule  set’s 
error  on  the  training  data. 

The  whole  process  is  complex  and  time-consuming. 
Five  separate  stages  are  required  to  produce  the  find 
rule  set.  It  has  been  shown  that  for  noisy  datasets, 
runtime  is  cubic  in  the  number  of  instances  (Cohen, 
1995).  Moreover,  despite  the  lengthy  optimization 
process,  rules  are  still  restricted  to  conjunctions  of 
those  attribute-value  tests  that  occur  along  a  path  in 
the  initial  decision  tree. 

Separate-and-conquer  algorithms  represent  a  more  di¬ 
rect  approach  to  learning  decision  rules.  They  gen¬ 
erate  one  rule  at  a  time,  remove  the  instances  cov- 

^A  test  version  of  C5.0  is  available  from 
http://www.rulequest.com. 


ered  by  that  rule,  and  iteratively  induce  further  rules 
for  the  remaining  instances.  In  a  multi-class  setting, 
this  automatically  leads  to  an  ordered  list  of  rules, 
a  type  of  classifier  that  has  been  termed  a  “decision 
list”  (Rivest,  1987).  Various  different  pruning  methods 
for  separate-and-conquer  algorithms  have  been  inves¬ 
tigated  by  Furnkranz  (1997),  who  shows  that  the  most 
effective  scheme  is  to  prune  each  rule  back  immediately 
after  it  is  generated,  using  a  separate  stopping  criterion 
to  determine  when  to  cease  adding  rules  (Furnkranz 
&  Widmer,  1994).  Although  originally  formulated  for 
two-class  problems,  this  procedure  can  be  applied  di¬ 
rectly  to  multi-class  settings  by  building  rules  sepa¬ 
rately  for  each  class  and  ordering  them  appropriately 
(Cohen,  1995). 

RIPPER  implements  this  strategy  using  reduced  error 
pruning  (Quinlan,  1987b),  which  sets  some  training 
data  aside  to  determine  when  to  drop  the  tail  of  a 
rule,  and  incorporates  a  heuristic  based  on  the  mini¬ 
mum  description  length  principle  as  stopping  criterion. 
It  follows  rule  induction  with  a  post-processing  step 
that  revises  the  rule  set  to  more  closely  approximate 
what  would  have  been  obtained  by  a  more  expensive 
global  pruning  strategy.  To  do  this,  it  considers  “re¬ 
placing”  or  “revising”  individual  rules,  guided  by  the 
error  of  the  modified  rule  set  on  the  pruning  data.  It 
then  decides  whether  to  leave  the  original  rule  alone  or 
substitute  its  replacement  or  revision,  a  decision  that 
is  made  according  to  the  minimum  description  length 
heuristic.  It  has  been  claimed  (Cohen,  1995)  that  RIP¬ 
PER  generates  rule  sets  that  are  as  accurate  as  C4.5’s. 
However,  our  experiments  on  a  large  collection  of  stan¬ 
dard  datasets — reported  in  Section  3 — do  not  confirm 
this. 

As  the  following  example  shows,  the  basic  strategy  of 
building  a  single  rule  and  pruning  it  back  can  lead  to 
a  particularly  problematic  form  of  overpruning,  which 
we  call  “hasty  generalization.”  This  is  because  the 
pruning  interacts  with  the  covering  heuristic.  General¬ 
izations  are  made  before  their  implications  are  known, 
and  the  covering  heuristic  then  prevents  the  learning 
algorithm  from  discovering  the  implications. 

Here  is  a  simple  example  of  hasty  generalization.  Con¬ 
sider  a  Boolean  dataset  with  attributes  a  and  b  built 
from  the  three  rules  in  Figure  1,  corrupted  by  ten  per¬ 
cent  class  noise.  Assume  that  the  pruning  operator  is 
conservative  and  can  only  delete  a  single  final  conjunc¬ 
tion  of  a  rule  at  a  time  (not  an  entire  tail  of  conjunc¬ 
tions  as  RIPPER  does).  Assume  further  that  the  first 
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Rule 

Coverage 

TVaining  Set 

Pruning  Set 

© 

0 

© 

0 

1:  a  =  true  © 

90 

8 

30 

5 

2:  a  =  false  A  b  =  true  =>  © 

200 

18 

66 

6 

3:  a  =  false  A  b  =  false  ^  © 

1 

10 

0 

3 

Figure  1:  A  hypothetical  target  concept  for  a  noisy  domain. 


rule  has  been  generated  and  pruned  back  to 
a  =  true  © 

(The  training  data  in  Figure  1  is  solely  to  make  this 
scenario  plausible.)  Now  consider  whether  the  rule 
should  be  further  pruned.  Its  error  rate  on  the  pruning 
set  is  5/35,  and  the  null  rule 

=>  © 

has  an  error  rate  of  14/110,  which  is  smaller.  Thus  the 
the  rule  set  will  be  pruned  back  to  this  single,  trivial, 
rule,  instead  of  the  patently  more  accurate  three-rule 
set  shown  in  Figure  1. 

Hasty  generalization  is  not  just  an  artifact  of  reduced 
error  pruning:  it  can  happen  with  pessimistic  prun¬ 
ing  (Quinlan,  1993)  too.  Because  of  variation  in  the 
number  of  noisy  instances  in  the  data  sample,  one  can 
always  construct  situations  in  which  pruning  causes 
rules  with  comparatively  large  coverage  to  swallow 
rules  with  smaller  (but  still  significant)  coverage.  This 
can  happen  whenever  the  number  of  errors  committed 
by  a  rule  is  large  compared  with  the  total  number  of 
instances  covered  by  an  adjacent  rule. 

3  Obtaining  Rules  From  Partial 
Decision  Trees 

The  new  method  for  rule  induction,  PART,  combines 
the  two  approaches  discussed  in  Section  2  in  an  at¬ 
tempt  to  avoid  their  respective  problems.  Unlike  both 
C4.5  and  RIPPER  it  does  not  need  to  perform  global 
optimization  to  produce  accurate  rule  sets,  and  this 
added  simplicity  is  its  main  advantage.  It  adopts  the 
separate-and-conquer  strategy  in  that  it  builds  a  rule, 
removes  the  instances  it  covers,  and  continues  creat¬ 
ing  rules  recursively  for  the  remaining  instances  until 
none  are  left.  It  differs  from  the  standard  approach 


in  the  way  that  each  rule  is  created.  In  essence,  to 
make  a  single  rule  a  pruned  decision  tree  is  built  for 
the  current  set  of  instances,  the  leaf  with  the  largest 
coverage  is  made  into  a  rule,  and  the  tree  is  discarded. 
This  avoids  hasty  generalization  by  only  generalizing 
once  the  implications  are  known  (i.e.,  all  the  subtrees 
have  been  expanded). 

The  prospect  of  repeatedly  building  decision  trees  only 
to  discard  most  of  them  is  not  as  bizarre  as  it  first 
seems.  Using  a  pruned  tree  to  obtain  a  rule  instead  of 
building  it  incrementally  by  adding  conjunctions  one 
at  a  time  avoids  the  over-pruning  problem  of  the  basic 
separate-and-conquer  rule  learner.  Using  the  separate- 
and-conquer  methodology  in  conjunction  with  decision 
trees  adds  flexibility  and  speed.  It  is  indeed  wasteful  to 
build  a  full  decision  tree  just  to  obtain  a  single  rule,  but 
the  process  can  be  accelerated  significantly  without 
sacrificing  the  above  advantages. 

The  key  idea  is  to  build  a  “partial”  decision  tree  in¬ 
stead  of  a  fully  explored  one.  A  partial  decision  tree 
is  an  ordinary  decision  tree  that  contains  branches  to 
undefined  subtrees.  To  generate  such  a  tree,  we  inte¬ 
grate  the  construction  and  pruning  operations  in  order 
to  find  a  “stable”  subtree  that  can  be  simplified  no  fur¬ 
ther.  Once  this  subtree  has  been  found,  tree-building 
ceases  and  a  single  rule  is  read  off. 

The  tree-building  algorithm  is  summarized  in  Figure  2: 
it  splits  a  set  of  examples  recursively  into  a  partial  tree. 
The  first  step  chooses  a  test  and  divides  the  examples 
into  subsets  accordingly.  Our  implementation  makes 
this  choice  in  exactly  the  same  way  as  C4.5.  Then 
the  subsets  are  expanded  in  order  of  their  average  en¬ 
tropy,  starting  with  the  smallest.  (The  reason  for  this 
is  that  subsequent  subsets  will  most  likely  not  end  up 
being  expanded,  and  the  subset  with  low  average  en¬ 
tropy  is  more  likely  to  result  in  a  small  subtree  and 
therefore  produce  a  more  general  rule.)  This  continues 
recursively  until  a  subset  is  expanded  into  a  leaf,  and 
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Procedure  Expand  Subset 

choose  split  of  given  set  of  examples  into  subsets 
while  there  are  subsets  that  have  not  been  expanded  and 
all  the  subsets  expanded  so  far  are  leaves 
choose  next  subset  to  be  expanded  and  expand  it 
if  all  the  subsets  expanded  are  leaves  and 
estimated  error  for  subtree  >  estimated  error  for  node 
undo  expansion  into  subsets  and  make  node  a  leaf 

Figure  2:  Method  that  expands  a  given  set  of  examples  into  a  partial  tree 


Figure  3:  Example  of  how  our  algorithm  builds  a  partial  tree 


then  continues  further  by  backtracking.  But  as  soon 
as  an  internal  node  appears  which  has  all  its  children 
expanded  into  leaves,  pruning  begins:  the  algorithm 
checks  whether  that  node  is  better  replaced  by  a  single 
leaf.  This  is  just  the  standard  “subtree  replacement” 
operation  of  decision-tree  pruning,  and  our  implemen¬ 
tation  makes  the  decision  in  exactly  the  same  way  as 
C4.5.  (C4.5’s  other  pruning  operation,  “subtree  rais¬ 
ing,”  plays  no  part  in  our  algorithm.)  If  replacement 
is  performed  the  algorithm  backtracks  in  the  standard 
way,  exploring  siblings  of  the  newly-replaced  node. 
However,  if  during  backtracking  a  node  is  encountered 
all  of  whose  children  are  not  leaves — and  this  will  hap¬ 
pen  as  soon  as  a  potential  subtree  replacement  is  not 
performed — then  the  remaining  subsets  are  left  unex¬ 
plored  and  the  corresponding  subtrees  are  left  unde¬ 
fined.  Due  to  the  recursive  structure  of  the  algorithm 
this  event  automatically  terminates  tree  generation. 


Figure  3  shows  a  step-by-step  example.  During  stages 
1-3,  tree-building  continues  recursively  in  the  normal 
way — except  that  at  each  point  the  lowest-entropy  sib¬ 
ling  is  chosen  for  expansion:  node  3  between  stages  1 
and  2.  Gray  nodes  are  as  yet  unexpanded;  black  ones 
are  leaves.  Between  Stages  2  and  3,  the  black  node  will 
have  lower  entropy  than  its  sibling,  node  5;  but  cannot 
be  expanded  further  since  it  is  a  leaf.  Backtracking  oc¬ 
curs  and  node  5  is  chosen  for  expansion.  Once  stage 
3  is  reached,  there  is  a  node — node  5 — which  has  all 
of  its  children  expanded  into  leaves,  and  this  triggers 
pruning.  Subtree  replacement  for  node  5  is  consid¬ 
ered,  and  accepted,  leading  to  stage  4.  Now  node  3  is 
considered  for  subtree  replacement,  and  this  operation 
is  again  accepted.  Backtracking  continues,  and  node 
4,  having  lower  entropy  than  2,  is  expanded — into  two 
leaves.  Now  subtree  replacement  is  considered  for  node 
4:  let  us  suppose  that  node  4  is  not  replaced.  At  this 
point,  the  process  effectively  terminates  with  the  3-leaf 
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(ab+bcd+defg)  with  12  irrelevant  binary  attributes  and  uniformly  distributed  examples 


Figure  4:  CPU  times  for  PART  on  artificial  dataset 


partial  tree  of  stage  5. 

This  procedure  ensures  that  the  over-pruning  effect 
discussed  in  Section  2  cannot  occur.  A  node  can  only 
be  pruned  if  all  its  successors  are  leaves.  This  can 
only  happen  if  all  its  subtrees  have  been  explored  and 
either  found  to  be  leaves,  or  are  pruned  back  to  leaves. 
Situations  like  that  shown  in  Figure  1  are  therefore 
handled  correctly. 

If  a  dataset  is  noise-free  and  contains  enough  instances 
to  prevent  the  algorithm  from  doing  any  pruning, 
just  one  path  of  the  full  decision  tree  has  to  be  ex¬ 
plored.  This  achieves  the  greatest  possible  perfor¬ 
mance  gain  over  the  naive  method  that  builds  a  full 
decision  tree  each  time.  The  gain  decreases  as  more 
pruning  takes  place.  For  datasets  with  numeric  at¬ 
tributes,  the  asymptotic  time  complexity  of  the  algo¬ 
rithm  is  the  same  as  for  building  the  full  decision  tree^ 
because  in  this  case  the  complexity  is  dominated  by 
the  time  needed  to  sort  the  attribute  values  in  the 
first  place. 

Once  a  partial  tree  has  been  built,  a  single  rule  is  ex¬ 
tracted  from  it.  Each  leaf  corresponds  to  a  possible 
rule,  and  we  seek  the  “best”  leaf  of  those  subtrees  (typ¬ 
ically  a  small  minority)  that  have  been  expanded  into 
leaves.  Our  implementation  aims  at  the  most  general 
rule  by  choosing  the  leaf  that  covers  the  greatest  num¬ 
ber  of  instances.  (We  have  experimented  with  choosing 


^Assuming  no  subtree  raising. 


the  most  accurate  rule,  that  is,  the  leaf  with  the  lowest 
error  rate,  error  being  estimated  according  to  C4.5’s 
Bernoulli  heuristic,  but  this  does  not  improve  the  rule 
set’s  accuracy.) 

Datasets  often  contain  missing  attribute  values,  and 
practical  learning  schemes  must  deal  with  them  ef¬ 
ficiently.  When  constructing  a  partial  tree  we  treat 
missing  values  in  exactly  the  same  way  as  C4.5:  if 
an  instance  cannot  be  assigned  deterministically  to  a 
branch  because  of  a  missing  attribute  value,  it  is  as¬ 
signed  to  each  of  the  branches  with  a  weight  propor¬ 
tional  to  the  number  of  training  instances  going  down 
that  branch,  normalized  by  the  total  number  of  train¬ 
ing  instances  with  known  values  at  the  node.  During 
testing  we  apply  the  same  procedure  separately  to  each 
rule,  thus  associating  a  weight  with  the  application  of 
each  rule  to  the  test  instance.  That  weight  is  deducted 
from  the  instance’s  total  weight  before  it  is  passed  to 
the  next  rule  in  the  list.  Once  the  weight  has  reduced 
to  zero,  the  predicted  class  probabilities  are  combined 
into  a  final  classification  according  to  the  weights. 

The  algorithm’s  runtime  depends  on  the  number  of 
rules  it  generates.  Because  a  decision  tree  can  be 
built  in  time  0(an  log  n)  for  a  dataset  with  n  exam¬ 
ples  and  a  attributes,  the  time  taken  to  generate  a 
rule  set  of  size  k  is  0{kanlogn).  Assuming  (as  the 
analyses  of  (Cohen,  1995)  and  (Fiirnkranz,  1997)  do) 
that  the  size  of  the  final  theory  is  constant,  the  over¬ 
all  time  complexity  is  O(anlogn),  as  compared  to 
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Table  1:  Datasets  used  for  the  experiments 


Dataset 

Instances 

Missing 
values  (%) 

Numeric 

attributes 

Nominal 

attributes 

Classes 

anneal 

898 

0.0 

6 

32 

5 

audiology 

226 

2.0 

0 

69 

24 

australian 

690 

0.6 

6 

9 

2 

autos 

205 

1.1 

15 

10 

6 

balance-scale 

625 

0.0 

4 

0 

3 

breast-cancer 

286 

0.3 

0 

9 

2 

breast-w 

699 

0.3 

9 

0 

2 

german 

1000 

0.0 

7 

13 

2 

glass  {G2) 

163 

0.0 

9 

0 

2 

glass 

214 

0.0 

9 

0 

6 

heart-c 

303 

0.2 

6 

7 

2 

heart-h 

294 

20.4 

6 

7 

2 

heart-statlog 

270 

0.0 

13 

0 

2 

hepatitis 

155 

5.6 

6 

13 

2 

horse-colic 

368 

23.8 

7 

15 

2 

hypothyroid 

3772 

5.5 

7 

22 

4 

ionosphere 

351 

0.0 

34 

0 

2 

iris 

150 

0.0 

4 

0 

3 

kr-vs-kp 

3196 

0.0 

0 

36 

2 

labor 

57 

3.9 

8 

8 

2 

lymphography 

148 

0.0 

3 

15 

4 

mushroom 

8124 

1.4 

0 

22 

2 

pima-indians 

768 

0.0 

8 

0 

2 

primary-tumor 

339 

3.9 

0 

17 

21 

segment 

2310 

0.0 

19 

0 

7 

sick 

3772 

5.5 

7 

22 

2 

sonar 

208 

0.0 

60 

0 

2 

soybean 

683 

9.8 

0 

25 

19 

splice 

3190 

0.0 

0 

61 

3 

yehicle 

846 

0.0 

18 

0 

4 

yote 

435 

5.6 

0 

16 

2 

yowel 

990 

0.0 

10 

3 

11 

waveform-noise 

5000 

0.0 

40 

0 

3 

ZOO 

101 

0.0 

1 

15 

7 

O(anlog^n)  for  RIPPER.  In  practice,  the  number  of 
rules  grows  with  the  size  of  the  training  data  because  of 
the  greedy  rule  learning  strategy  and  pessimistic  prun¬ 
ing.  However,  even  in  the  worst  case  when  the  num¬ 
ber  of  rules  increases  linearly  with  training  examples, 
the  overall  complexity  is  bounded  by  0(an^  logn).  In 
our  experiments  we  only  ever  observed  subquadratic 
run  times — even  for  the  artificial  dataset  that  Cohen 
(1995)  used  to  show  that  C4.5’s  performance  can  be 
cubic  in  the  number  of  examples.  The  results  of  timing 
our  method,  PART,  on  this  dataset  are  depicted  on  a 
log-log  scale  in  Figure  4,  for  no  class  noise  and  for  20 
percent  class  noise.  In  the  latter  case  C4.5  scales  as 
the  cube  of  the  number  of  examples. 


4  Experimental  Results 

In  order  to  evaluate  the  performance  of  PART  on  a  di¬ 
verse  set  of  practical  learning  problems,  we  performed 
experiments  on  thirty-four  standard  datasets  from  the 
UCI  collection  (Merz  &  Murphy,  1996).®  The  datasets 
and  their  characteristics  are  listed  in  Table  1. 

As  well  as  the  learning  algorithm  PART  described 
above,  we  also  ran  C4.5,'*  C5.0  and  RIPPER  on  all 
the  datasets.  The  results  are  listed  in  Table  2.  They 
give  the  percentage  of  correct  classifications,  averaged 
over  ten  ten-fold  cross-validation  runs,  and  standard 

^Following  Holte  (Holte,  1993),  the  G2  variant  of  the 
glass  dataset  has  classes  1  and  3  combined  and  classes  4  to 
7  deleted,  and  the  horse-colic  dataset  has  attributes  3,  25, 
26,  27,  28  deleted  with  attribute  24  being  used  as  the  class. 
We  also  deleted  all  identifier  attributes  from  the  datasets. 
^We  used  Revision  8  of  C4.5. 
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Table  2:  Experimental  results:  percentage  of  correct  classifications,  and  standard  deviation 


Dataset 

PART 

C4.5 

C5.0 

RIPPER 

anneal 

98.4±0.3 

98.6T0.2 

t 

98.7±0.3 

0 

98.3±0.1 

audiology 

78.7±1.1 

76.3±1.2 

.t 

77.3±1.2 

t 

72.3±2.2 

• 

australian 

84.3±1.2 

84.8±1.1 

• 

85.4±0.7 

85.3±0.7 

autos 

74.5±1.4 

76.5±2.9 

t 

79.1±2.1 

72.0±2.0 

• 

balance-scale 

82.3±1.2 

78.0±0.7 

• 

79.0±1.0 

• 

81.0±1.1 

breast-cancer 

69.6±1.6 

70.3±1.6 

73.6±1.6 

o 

71.8±1.6 

0 

breast-w 

94.9±0.4 

95.5±0.6 

t 

95.5±0.3 

0 

95.6±0.7 

horse-colic 

84.4±0.8 

83.0±0.6 

• 

85.0T0.5 

85.0±0.8 

german 

70.0±1.4 

71.9±1.4 

o 

72.3±0.5 

0 

71.4±0.7 

0 

glass  (G2) 

80.0±3.6 

79.4±2.3 

t 

80.2±1.8 

t 

80.9±1.4 

glass 

70.0dbl.6 

67.3±2.4 

68.4±2.8 

t 

66.7±2.1 

• 

heeirt-c 

78.5±1.7 

79.7±1.5 

79.1±0.9 

78.5±1.9 

heart-h 

80.5±1.5 

79.7±1.7 

80.7±1.1 

78.7±1.3 

• 

heart-statlog 

78.9±1.3 

81.2±1.3 

o 

81.9T1.4 

o 

79.0±1.4 

hepatitis 

80.2±1.9 

79.7±1.0 

t 

81.1T0.7 

77.2±2.0 

• 

hypothyroid 

99.5±0.1 

99.5±0.1 

t 

99.5±0.0 

.t 

99.4±0.1 

• 

ionosphere 

90.6±1.3 

89.9T1.5 

t 

89.3±1.4 

.t 

89.2±0.8 

• 

iris 

93.7±1.6 

95.1±1.0 

94,4±0.7 

t 

94.4±1.7 

kr-vs-kp 

99.3±0.1 

99.4±0.1 

99.3±0.1 

99.1±0.1 

• 

labor 

77.3±3.9 

81.4±2.6 

77.1±3.7 

t 

83.5±3.9 

0 

lymphography 

76.5±2.7 

78.0±2.2 

76.8±2.7 

t 

76.1±2.4 

mushroom 

100.0±0.0 

lOO.OiO.O 

.t 

99.9±0.0 

100.0±0.0 

pima-indians 

74.0±0.5 

74.2±1.2 

t 

75.5±0.9 

0^ 

75.2±1.1 

0 

primary-tumor 

41.7±1.3 

40.1±1.7 

• 

28.7±2.5 

• 

38.5±0.8 

• 

segment 

96.6±0.4 

96.1  ±0.3 

•* 

96.3±0.4 

t 

95.2±0.5 

• 

sick 

98.6±0.1 

98.4±0.2 

• 

98.4±0.1 

• 

98.3±0.2 

• 

sonar 

76.5±2.3 

74.4±2.9 

t 

75.3±2.2 

t 

75.7±1.9 

soybean 

91.4±0.5 

91.9±0.7 

92.2±0.6 

t 

92.0±0.4 

splice 

92.5±0.4 

93.4±0.3 

o 

94,3±0.3 

0 

93.4±0.2 

0 

vehicle 

72.4±0.8 

72.9±0.9 

72.4±0.8 

t 

69.0±0.6 

• 

vote 

95.9±0.6 

95.9±0.6 

t 

96.0±0.6 

t 

95.6±0.3 

vowel 

78.1±1.1 

77.9T1.3 

t 

79.9T1.2 

69.6±1.9 

• 

waveform-noise 

78.0±0.5 

76.3T0.4 

• 

79.4T0.5 

0 

79.1±0.6 

o 

ZOO 

92.2±1.2 

90.9T1.2 

.t 

91.5±1.2 

t 

87.8±2.4 

• 

deviations  of  the  ten  are  also  shown.  The  same  folds 
were  used  for  each  scheme.®  Results  for  C4.5,  C5.0 
and  RIPPER  are  marked  with  o  if  they  show  signif¬ 
icant  improvement  over  the  corresponding  results  for 
PART,  and  with  •  if  they  show  significant  degrada¬ 
tion.  (The  t  marks  are  discussed  below.)  Through¬ 
out,  we  speak  of  results  being  “significantly  different” 
if  the  difference  is  statistically  significant  at  the  1% 
level  according  to  a  paired  two-sided  f-test,  each  pair 
of  data  points  consisting  of  the  estimates  obtained  in 
one  ten-fold  cross-validation  run  for  the  two  learning 
schemes  being  compared.  Table  3  shows  how  the  dif¬ 
ferent  methods  compare  with  each  other.  Each  entry 


®The  results  of  PART  and  C5.0  on  the  hypothyroid 
data,  and  of  PART  and  C4.5  on  the  mushroom  data,  are 
not  in  fact  the  same — they  differ  in  the  second  decimal 
place. 


indicates  the  number  of  datasets  for  which  the  method 
associated  with  its  column  is  significantly  more  accu¬ 
rate  than  the  method  associated  with  its  row. 

We  observe  from  Table  3  that  PART  outperforms  C4.5 
on  nine  datasets,  whereas  C4.5  outperforms  PART  on 
six.  The  chance  probability  of  this  distribution  is  0.3 
according  to  a  sign  test:  thus  there  is  only  very  weak 
evidence  that  PART  outperforms  C4.5  on  a  collection 
of  datasets  similar  to  the  one  we  used.  According  to 
Table  3,  PART  is  significantly  less  accurate  than  C5.0 
on  ten  datasets  and  significantly  more  accurate  on  six. 
The  corresponding  probability  for  this  distribution  is 
0.23,  providing  only  weak  evidence  that  C5.0  performs 
better  than  PART.  For  RIPPER  the  situation  is  dif¬ 
ferent:  PART  outperforms  it  on  fourteen  datasets  and 
performs  worse  on  six.  The  probability  for  this  dis¬ 
tribution  is  0.06,  a  value  that  provides  fairly  strong 
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evidence  that  PART  outperforms  RIPPER  on  a  col¬ 
lection  of  datasets  of  this  type. 

Table  3:  Results  of  paired  t-tests  (p=0.01):  number  in¬ 
dicates  how  often  method  in  column  significantly  out¬ 
performs  method  in  row 


PART 

C4.5 

C5.0 

RIPPER 

PART 

- 

6 

10 

6 

C4.5 

9 

- 

9 

4 

C5.0 

6 

5 

- 

4 

RIPPER 

14 

10 

12 

- 

As  well  as  accuracy,  the  size  of  a  rule  set  is  impor¬ 
tant  because  it  has  a  strong  influence  on  comprehen¬ 
sibility.  The  t  marks  in  Table  2  give  information 
about  the  relative  size  of  the  rule  sets  produced:  they 
mark  learning  schemes  and  datasets  for  which — on 
average — PART  generates  fewer  rules  (this  never  oc¬ 
curs  for  RIPPER).  Compared  to  C4.5  and  C5.0,  the 
average  number  of  rules  generated  by  PART  is  smaller 
for  eighteen  datasets  and  larger  for  sixteen. 

5  Conclusions 

This  paper  has  presented  a  simple,  yet  surprisingly 
effective,  method  for  learning  decision  lists  based  on 
the  repeated  generation  of  partial  decision  trees  in  a 
separate-and-conquer  manner.  The  main  advantage  of 
PART  over  the  other  schemes  discussed  is  not  perfor¬ 
mance  but  simplicity:  by  combining  two  paradigms  of 
rule  learning  it  produces  good  rule  sets  without  any 
need  for  global  optimization.  Despite  this  simplicity, 
the  method  produces  rule  sets  that  compare  favorably 
with  those  generated  by  C4.5  and  C5.0,  and  are  more 
accurate  (though  larger)  than  those  produced  by  RIP¬ 
PER. 

An  interesting  question  for  future  research  is  whether 
the  size  of  the  rule  sets  obtained  by  our  method  can  be 
decreased  by  employing  a  stopping  criterion  based  on 
the  minimum  description  length  principle,  as  is  done 
in  RIPPER,  or  by  using  reduced  error  pruning  instead 
of  pessimistic  pruning. 
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Abstract 

Most  techniques  for  attribute  selection  in 
decision  trees  are  biased  towards  attributes 
with  many  values,  and  several  ad  hoc  solu¬ 
tions  to  this  problem  have  appeared  in  the 
machine  learning  literature.  Statistical  tests 
for  the  existence  of  an  association  with  a 
prespecified  significance  level  provide  a  well- 
founded  basis  for  addressing  the  problem. 
However,  many  statistical  tests  are  computed 
from  a  chi-squared  distribution,  which  is  only 
a  valid  approximation  to  the  actual  distri¬ 
bution  in  the  large-sample  case — and  this 
patently  does  not  hold  near  the  leaves  of  a 
decision  tree.  An  exception  is  the  class  of 
permutation  tests.  We  describe  how  permu¬ 
tation  tests  can  be  applied  to  this  problem. 

We  choose  one  such  test  for  further  explo¬ 
ration,  and  give  a  novel  two-stage  method  for 
applying  it  to  select  attributes  in  a  decision 
tree.  Results  on  practical  datasets  compare 
favorably  with  other  methods  that  also  adopt 
a  pre-pruning  strategy. 

1  Introduction 

Statistical  tests  provide  a  set  of  theoretically  well- 
founded  tools  for  testing  hypotheses  about  relation¬ 
ships  in  a  set  of  data.  One  pertinent  hypothesis,  when 
selecting  attributes  for  a  decision  tree,  is  whether  there 
is  a  significant  association  between  an  attribute’s  val¬ 
ues  and  the  classes.  With  r  attribute  values  and  c 
classes,  this  equates  to  testing  for  independence  in  the 
corresponding  r  x  c  contingency  table  (White  &  Liu, 
1994),  and  statistical  tests  designed  for  this  purpose 
can  be  applied  directly.  Unlike  most  commonly-used 


attribute  selection  criteria,  such  tests  are  not  biased 
towards  attributes  with  many  values,  which  is  impor¬ 
tant  because  it  prevents  the  decision  tree  induction  al¬ 
gorithm  from  selecting  splits  that  overfit  the  training 
data  by  being  too  fine-grained. 

Statistical  tests  are  based  on  probabilities  derived  from 
the  distribution  of  a  test  statistic.  Two  popular  test 
statistics  for  assessing  independence  in  a  contingency 
table  have  been  proposed  for  attribute  selection:  the 
chi-squared  statistic  and  the  log  likelihood  ratio 
(?2  (White  &  Liu,  1994).  For  large  samples,  both  are 
distributed  according  to  the  chi-squared  distribution. 
But  this  is  not  the  case  for  small  samples  (Agresti, 
1990) — and  small  samples  inevitably  occur  close  to  the 
leaves  in  a  decision  tree.  Thus  it  is  inadvisable  to  use 
probabilities  derived  using  the  chi-squared  distribution 
for  decision  tree  induction. 

Fortunately,  there  is  an  alternative  that  does  apply  in 
small  frequency  domains.  In  statistical  tests  known  as 
“permutation  tests”  (Good,  1994),  the  distribution  of 
the  statistic  of  interest  is  calculated  directly  instead 
of  relying  on  the  chi-squared  approximation — in  other 
words  they  are  “non-parametric”  rather  than  “para¬ 
metric.”  Such  tests  do  not  suffer  from  the  small  ex¬ 
pected  frequency  problem  because  they  do  not  use  the 
chi-squared  approximation. 

This  paper  describes  the  application  of  permutation 
tests  to  attribute  selection  in  a  decision  tree.  We  ex¬ 
amine  one  such  test — the  Freeman  and  Halton  test — 
in  detail  by  performing  experiments  on  artificial  and 
practical  datasets:  the  results  show  that  this  method 
is  indeed  preferable  to  a  test  that  assumes  the  chi- 
squared  distribution.  The  statistic  of  the  Freeman  and 
Halton  test  is  the  exact  probability  p/  of  a  contin¬ 
gency  table  /  given  its  marginal  totals  (Good,  1994). 
Recently,  Martin  (1997)  investigated  the  use  of  this 
statistic,  pf,  directly  for  attribute  selection.  We  show 
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that  results  can  be  improved  by  using  it  in  conjunction 
with  the  Freeman  and  Halton  test. 

Section  2  introduces  the  idea  of  permutation  tests  and 
how  they  can  be  used  to  test  significance  in  a  contin¬ 
gency  table.  In  Section  2.2  we  describe  the  Freeman 
and  Halton  test.  The  test  is  expensive,  but  simple 
computational  economies  are  described  in  Section  2.3. 
Section  2.4  describes  a  novel  two-stage  method,  based 
on  these  ideas,  for  selecting  attributes  in  a  decision 
tree.  Section  3  presents  experimental  results  on  arti¬ 
ficial  and  standard  datasets.  We  verify  that  the  Free¬ 
man  and  Halton  test  does  not  prefer  attributes  with 
many  values,  whereas  the  test  statistic  pf  by  itself  is 
biased.  We  also  verify  that  the  parametric  version  of 
the  chi-squared  test  is  biased  in  small-frequency  do¬ 
mains.  Finally,  we  demonstrate  that  good  results  are 
obtained  when  the  new  method  is  applied  to  decision- 
tree  building.  Section  4  reviews  existing  work  on  us¬ 
ing  statistical  tests  for  contingency  tables  in  machine 
learning,  while  Section  5  contains  some  concluding  re¬ 
marks. 

2  A  Permutation  Test  and  its 

Application  to  Attribute  Selection 

The  procedure  for  permutation  tests  is  simple  (Good, 
1994).  First,  a  test  statistic  is  chosen  that  measures 
the  strength  of  the  effect  being  investigated,  and  is 
computed  over  the  data.  The  null  hypothesis  is  that 
the  observed  strength  of  the  effect  is  not  significant. 
Next,  the  labels  of  the  original  data  are  permuted 
and  the  same  statistic  is  calculated  for  the  relabeled 
data;  this  is  repeated  for  all  possible  permutations  of 
labels.  The  idea  is  to  ascertain  the  likelihood  of  an 
effect  of  the  same  or  greater  strength  being  observed 
fortuitously  on  randomly  labeled  data  with  identical 
marginal  properties.  Third,  the  test  statistic’s  value 
for  the  original  data  is  compared  with  the  values  ob¬ 
tained  over  all  permutations,  by  calculating  the  per¬ 
centage  of  the  latter  that  are  at  least  as  extreme,  or 
more  extreme,  than  the  former.  This  percentage  con¬ 
stitutes  the  significance  level  at  which  the  null  hypoth¬ 
esis  can  be  rejected,  in  other  words,  the  level  at  which 
the  observed  strength  of  the  effect  can  be  considered 
significant. 

2.1  Permutation  Tests  for  Contingency 
Tables 

Contingency  tables  summarize  the  observed  relation¬ 
ship  between  two  categorical  response  variables.  Sev¬ 
eral  different  statistics  can  be  used  to  measure  the 


strength  of  the  dependency  between  two  variables 
(Good,  1994),  the  two  most  common  being  the  chi- 
squared  statistic  the  log  likelihood  ratio  G2. 

The  standard  tests  using  these  statistics  are  based  on 
the  fact  that  the  sampling  distribution  of  both  statis¬ 
tics  is  well-approximated  by  the  chi-squared  distribu¬ 
tion.  They  calculate  the  significance  level  directly  from 
that  distribution. 

Unfortunately,  as  noted  in  the  introduction,  the  chi- 
squared  distribution  assumption  is  only  valid  for  either 
statistic  when  the  sample  size  is  large  enough.  The 
chi-squared  distribution  approximates  the  true  sam¬ 
pling  distribution  poorly  if  the  sample  size  is  small 
(or  the  samples  are  distributed  unevenly  in  the  con¬ 
tingency  table).  In  a  decision  tree  the  sample  size  be¬ 
comes  smaller  and  smaller  and  the  distribution  of  the 
samples  more  and  more  skewed  the  closer  one  gets  to 
the  leaves  of  the  tree.  Thus  one  cannot  justify  using 
a  test  based  on  the  chi-squared  approximation  for  sig¬ 
nificance  testing  throughout  a  decision  tree  (although 
one  might  at  the  upper  levels  where  samples  are  large). 
Permutation  tests  offer  a  theoretically  sound  alterna¬ 
tive  that  is  admissible  for  any  sample  size. 

The  standard  permutation  test  for  r  x  c  contingency 
tables,  which  we  have  also  chosen  to  employ  for  this 
paper,  is  based  on  the  statistic  p/,  the  exact  probabil¬ 
ity  of  a  contingency  table  given  its  marginal  totals.  It 
is  known  as  the  “Freeman  and  Halton”  test  and  it  is 
a  generalization  of  Fisher’s  exact  test  for  2  x  2  tables 
(Good,  1994).  However,  we  emphasize  that  other  test 
statistics  could  equally  well  be  used,  thereby  obtaining 
exact,  non-parametric,  versions  of  conventional  para¬ 
metric  tests  that  are  valid  in  small-frequency  domains 
(Good,  1994).! 

2.2  Testing  the  Significance  of  an  Attribute 

For  attribute  selection,  we  seek  to  test  whether  there  is 
a  significant  association  between  an  attribute’s  values 
and  the  class  values.  With  r  attribute  values  and  c 
classes,  this  is  the  same  as  testing  for  independence  in 
the  corresponding  r  x  c  contingency  table  (White  & 
Liu,  1994). 

If  the  rxc  contingency  table  f  contains  the  frequencies 
fij  with  column  marginals  f,j  and  row  marginals  /j., 
the  probability  p/  of  this  table  is  given  by 


^We  have  also  used  a  permutation  test  based  on 
instead  of  on  p/,  in  all  the  experiments  described  in  Section 
3,  and  obtained  almost  identical  results. 
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Permuting  the  instances’  class  labels  does  not  affect 
the  row  and  column  totals,  and  therefore  the  set  of  all 
permutations  of  the  class  labels  corresponds  to  the  set 
of  all  contingency  tables  with  the  same  row  and  column 
totals.  If  p  is  the  proportion  of  tables  for  which  p/  is 
less  than  or  equal  to  the  probability  Po  of  the  original 
table,  then 

<Po)Pf, 

where  /(.)  denotes  the  indicator  function,  constitutes 
the  p- value  of  the  Freeman  and  Halton  test.  The  func¬ 
tion  computing  p  is  known  as  a  multiple  hypergeomet¬ 
ric  distribution  (Agresti,  1990).  The  resulting  value 
of  p  is  simply  compared  with  a  prespecified  desired 
significance  level. 

2.3  Approximating  the  Exact  Test 

Exact  computation  of  the  p-value  of  a  permutation 
test  is  only  possible  for  sparsely  populated  tables,  and 
is  computationally  infeasible  for  most  tables  resulting 
from  practical  machine  learning  datasets.  Fortunately, 
p  can  be  approximated  to  arbitrary  precision  by  Monte 
Carlo  sampling  as  follows  (Good,  1994). 

For  each  of  n  trials  the  class  labels  are  randomly  per¬ 
muted,  the  test  statistic  is  computed,  and  its  value  is 
compared  to  the  value  for  the  original  (unpermuted) 
data.  The  percentage  of  trials  for  which  the  arti¬ 
ficially  generated  value  is  less  than  or  equal  to  the 
original  value  constitutes  an  estimate  p  of  the  ex¬ 
act  significance  level  p.  This  estimate  is  a  bino¬ 
mial  random  variable  with  standard  error  se(p)  = 
^p(l  -p)/n,  and  so  its  100(1  -  a)%  confidence  inter¬ 
val  is  p±t„_i(a/2)se(p),  where  t„_i(a/2)  is  obtained 
from  Student’s  t-distribution. 

This  information  is  used  to  decide  when  to  stop  per¬ 
forming  trials.  Let  pfixed  be  the  prespecified  desired 
minimum  significance  level  that  an  attribute  must 
achieve  unless  it  is  to  be  considered  independent  of 
the  class — the  level  at  which  the  null  hypothesis  of 
“no  significant  dependence”  is  to  be  rejected.  Then, 
with  probability  (1  —  a), 

P  ^  Pfixed  If  Pfixed  ^  P  fn— 1  (f^)^^(p)) 

and 

P  ^  Pfixed  If  Pfixed  ^  P  4"  tn— 1  (f^)^f^(p))* 


If  the  first  inequality  holds  we  judge  the  attribute  to 
be  significant;  if  the  second  holds  we  do  not.^  As  n 
increases,  the  likelihood  that  one  of  the  two  inequal¬ 
ities  will  be  true  increases,  but  if  p  is  very  close  to 
Pfixed,  neither  inequality  will  become  true  in  a  rea¬ 
sonable  amount  of  time.  Therefore  the  procedure  is 
terminated  when  the  number  of  trials  reaches  a  pre¬ 
specified  maximum,^  and  any  attribute  that  survives 
this  number  of  trials  is  considered  significant.  The  in¬ 
troduction  of  this  cut-off  point  slightly  increases  the 
probability  that  an  attribute  is  incorrectly  judged  to 
be  significant. 

2.4  Procedure  for  Attribute  Selection 

At  each  node  of  a  decision  tree  we  must  decide  which 
attribute  to  split  on.  This  is  done  in  two  steps.  First, 
attributes  are  rejected  if  they  show  no  significant  as¬ 
sociation  to  the  class  according  to  a  pre-specified  sig¬ 
nificance  level.  To  judge  “significance”  we  employ 
the  Freeman  and  Halton  test,  approximated  by  Monte 
Carlo  sampling  as  described  above.  Second,  from  the 
attributes  that  remain,  the  one  with  the  lowest  value 
of  pf  is  chosen.'*  The  selected  attribute  is  then  used  to 
split  the  set  of  instances,  and  the  algorithm  recurses. 

The  division  into  two  steps  is  a  crucial  part  of  the  pro¬ 
cedure.  It  distinguishes  clearly  between  the  different 
concepts  of  significance  and  strength.  For  example,  it 
is  well  known  that  the  association  between  two  distri¬ 
butions  may  be  very  significant  even  if  that  association 
is  weak— if  the  quantity  of  data  is  large  enough  (Press, 
Teukolsky,  Vettering  &  Flannery,  1988,  p.  628).  First, 
we  test  the  significance  of  an  association  using  a  per¬ 
mutation  test  (specifically,  the  FVeeman  and  Halton 
test);  then  we  consider  its  strength  (as  measured  by 
the  exact  probability  pj). 

If  no  significant  attributes  are  found  in  the  first  step, 
the  splitting  process  stops  and  the  subtree  is  not  ex¬ 
panded  any  further.  This  gives  an  elegant,  uniform, 
technique  for  pre-pruning. 

3  Experimental  Results 

We  begin  with  two  controlled  experiments  that  are  de¬ 
signed  to  verify  the  relative  performance  of  (a)  the  use 

^Here,  a  is  used  iastcad  of  a/2  because  the  comparisons 
are  one-sided.  In  our  experiments  we  set  a  to  0.005. 

®We  use  at  least  100  and  at  most  1000  trials  in  our 
experiments. 

■'Other  attribute  selection  criteria  could  be  employed  at 
this  stage;  p /  was  chosen  to  allow  a  direct  comparison  with 
the  method  proposed  by  Martin  (1997). 
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Table  1:  Average  probabilities  for  random  data  (600  instances;  uniformly  distributed  attribute  values) 


Attribute  Values 

Class  Values 

(a)p 

(b)  Pf 

(c)Px 

2 

2 

0.525 

0.045 

0.488 

2 

5 

0.511 

1.63e-05 

0.509 

2 

10 

0.506 

1.80e-10 

0.505 

5 

2 

0.497 

1.55e-05 

0.496 

5 

5 

0.500 

9.25e-18 

0.497 

5 

10 

0.491 

6.62e-35 

0.487 

10 

2 

0.498 

1.77e-10 

0.495 

10 

5 

0.520 

7.84e-35 

0.515 

10 

10 

0.512 

4.89e-68 

0.503 

Table  2:  Average  probabilities  for  random  data  (20  instances;  non-uniformly  distributed  attribute  values) 


Attribute  Values 

Class  Values 

(a)  p 

(b)  Pf 

{c)Px 

2 

2 

0.745 

0.285 

0.515 

2 

5 

0.674 

0.024 

0.466 

2 

10 

0.741 

0.004 

0.446 

5 

2 

0.549 

0.027 

0.444 

5 

5 

0.561 

1.02e-4 

0.448 

5 

10 

0.632 

1.80e-6 

0.418 

10 

2 

0.548 

0.004 

0.430 

10 

5 

0.581 

1.72e-6 

0.425 

10 

10 

0.639 

1.42e-8 

0.382 

of  the  exact-probability  pf  statistic  in  the  Freeman 
and  Halton  test,  (b)  the  use  of  p/  by  itself  with  no 
significance  test  (Martin,  1997),  and  (c)  the  use  of  the 
parametric  version  of  the  chi-squared  test,  that  is,  the 
probability  of  ^  calculated  from  the  chi-squared  dis¬ 
tribution  (White  &  Liu,  1994).  The  first  experiment 
exhibits  an  artificial  dataset  for  which  method  (b)  per¬ 
forms  poorly  because  it  is  biased  towards  many- valued 
attributes,  whereas  (a)  performs  well  (and  so  does  (c)). 
The  second  exhibits  another  dataset  for  which  method 
(c)  is  biased  towards  towards  many-valued  attributes 
and  performs  poorly  (and  (b)  performs  even  worse), 
whereas  (a)  continues  to  perform  well. 

The  third  subsection  presents  results  for  building  deci¬ 
sion  trees  on  practical  datasets  using  the  new  method. 

3.1  Using  the  Exact  Probability  p/  is  Biased 

In  order  to  show  that  the  exact  probability  p/  is  bi¬ 
ased  towards  attributes  with  many  values,  we  adopt 
the  experimental  setup  of  White  and  Liu  (1994).  This 
involves  an  artificial  dataset  that  exhibits  no  actual 
association  between  class  and  attribute  values.  For 
each  class,  an  equal  number  (300)  of  instances  with 
random,  uniformly  distributed  attribute  values  is  gen¬ 
erated.  The  estimated  p- value  of  the  Freeman  and  Hal¬ 
ton  test  p,  the  exact  probability  p/,  and  the  p- value  of 


the  parametric  chi-squared  test  p^  are  calculated  for 
this  artificial,  non-informative,  attribute.®  This  pro¬ 
cedure  is  repeated  1000  times  with  different  random 
seeds  used  to  generate  the  instances. 

Table  1  shows  the  average  values  obtained.  It  can 
be  seen  in  column  (b)  that  p/  systematically  decreases 
with  increasing  number  of  classes  and  attribute  values. 
Even  more  importantly,  it  is  always  close  to  zero.  If 
used  for  pre-pruning  at  the  0.01  level  (as  proposed  by 
Martin,  1997),  it  would  fail  to  stop  splitting  in  every 
situation  except  that  represented  by  the  first  row.  On 
the  other  hand,  neither  p  nor  p^  varies  systematically 
with  the  number  of  attribute  and  class  values.  For 
these  reasons  it  is  inadvisable  to  use  p/  for  attribute 
selection  without  preceding  it  with  a  significance  test. 

3.2  Parametric  Chi-Squared  Test  is  Biased 

A  similar  experimental  procedure  was  used  to  show 
that  the  parametric  chi-squared  test  is  biased  in  small 
frequency  domains  with  unevenly  distributed  samples. 
Instead  of  generating  the  attribute  values  uniformly, 
they  are  skewed  so  that  more  samples  lie  close  to  the 
zero  point.  This  is  done  using  the  distribution  , 
where  k  is  the  number  of  attribute  values  and  x  is 

®Our  experiments  use  N  =  1000  Monte  Carlo  trials  to 
estimate  p. 
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distributed  uniformly  between  0  and  1.  The  number 
of  instances  is  reduced  to  twenty. 

Table  2  shows  the  average  values  obtained  using  this 
procedure.  It  can  be  seen  that  decreases  system¬ 
atically  as  the  number  of  attribute  values  increases, 
whereas  this  is  not  the  case  for  p.  The  test  based  on 
p^  is  too  liberal  in  this  situation.  There  also  exist  sit¬ 
uations  in  which  it  is  too  conservative  (Good,  1994). 
If  used  for  pruning  in  a  decision  tree,  a  test  that  is  too 
liberal  does  not  prune  enough,  and  a  test  that  is  too 
conservative  prunes  too  much. 

3.3  Comparison  on  Practical  Datasets 

Results  are  now  presented  for  building  decision  trees 
for  thirty-one  UCI  datasets  (Merz  &  Murphy,  1996)  us¬ 
ing  the  method  described  above.  We  eliminated  miss¬ 
ing  values  from  the  datasets  by  deleting  all  attributes 
with  more  than  10%  missing  values,  and  subsequently 
removing  all  instances  with  missing  values.  The  result¬ 
ing  datasets  are  summarized  in  Table  3.  All  numeric 
attributes  were  discretized  into  four  intervals  of  equal 
width.® 

We  compare  pre-pruned  trees  built  using  (a)  pf  with 
prior  significance  testing  using  the  Freeman  and  Hal- 
ton  test  p,  (b)  the  exact  probability  pf,  (c)  p/  with 
prior  significance  testing  using  the  parametric  chi- 
squared  test  Px,  and  (d)  post-pruned  trees  built  us¬ 
ing  C4.5’s  pessimistic  pruning  with  default  parameter 
settings  (Quinlan,  1993).  We  also  include  results  for 
pruned  and  unpruned  trees  as  built  by  C4.5.  Note  that 
for  (a)  and  (c)  we  are  now  applying  the  two-step  at¬ 
tribute  selection  procedure  developed  in  Section  2.4, 
first  discarding  insignificant  attributes  and  then  se¬ 
lecting  the  best  among  the  remainder.  Results  are 
reported  for  three  significance  levels:  0.01,  0.05  and 
0.10.  All  results  were  generated  using  ten-fold  cross- 
validation  repeated  ten  times  with  different  random¬ 
izations  of  the  dataset.  The  same  folds  were  used  for 
each  scheme.^ 

Table  4  shows  how  method  (a)  compares  with  the  oth¬ 
ers.  Each  row  contains  the  number  of  datasets  for 
which  it  builds  significantly  more  (-f)  or  less  (-)  ac¬ 
curate  trees,  and  significantly  smaller  (-I-)  or  larger  (— ) 
trees  than  the  method  associated  with  this  row.  We 
speak  of  results  being  “significantly  different”  if  the 

®If  the  class  information  were  used  when  discretizing  the 
attributes,  the  assumptions  of  the  statistical  tests  would  be 
invalidated. 

^Appendix  A  lists  the  average  accuracy  and  standard 
deviation  for  a  representative  subset  of  the  methods. 


Table  3:  Datasets  used  for  the  experiments 


Dataset 

Size 

Attributes 

(numeric/total) 

Classes 

amneal 

898 

6738 

5 

audiology 

216 

0/67 

24 

australian 

653 

6/15 

2 

autos 

193 

14/24 

6 

balance-scale 

625 

4/  4 

3 

breast-cancer 

277 

0/  9 

2 

breast-w 

683 

9/  9 

2 

german 

1000 

7/20 

2 

glass  (G2) 

163 

9/  9 

2 

glass 

214 

9/  9 

6 

heart-c 

296 

6/13 

2 

heart-h 

261 

5/10 

2 

heart-statlog 

270 

13/13 

2 

hepatitis 

137 

3/16 

2 

hypothyroid 

3404 

2/24 

4 

ionosphere 

351 

34/34 

2 

iris 

150 

4/  4 

3 

kr-vs-kp 

3196 

0/36 

2 

lymphography 

148 

3/18 

4 

mushroom 

8124 

0/21 

2 

pima-indians 

768 

8/  8 

2 

primary-tumor 

336 

0/15 

21 

segment 

2310 

19/19 

7 

sick 

3404 

2/24 

2 

sonar 

208 

60/60 

2 

soybean 

630 

0/16 

15 

splice 

3190 

0/61 

3 

vehicle 

846 

18/18 

4 

vote 

312 

0/15 

2 

vowel 

990 

10/13 

11 

ZOO 

101 

1/16 

7 

difference  is  statistically  significant  at  the  1%  level  ac¬ 
cording  to  a  paired  two-sided  t-test,  each  pair  of  data 
points  consisting  of  the  estimates  obtained  in  one  ten¬ 
fold  cross-validation  run  for  the  two  learning  schemes 
being  compared.  Results  are  shown  for  three  different 
significance  levels:  note  that  this  refers  to  the  level 
at  which  attributes  are  rejected  prior  to  the  selection 
process. 

Observe  first  that  pre-pruning  using  p  outperforms 
pre-pruning  using  p/  (the  three  rows  marked  (b)),  con¬ 
firming  our  findings  from  Section  3.1.  For  all  three  sig¬ 
nificance  levels  p  dominates  p/  in  both  accuracy  and 
size  of  the  trees  produced.  These  results  show  that  if 
the  splitting  attribute  is  selected  based  on  the  value  of 
p/,  it  is  better  to  use  a  significance  test  first. 

One  might  think  that  p/  performs  poorly  with  respect 
to  p  because  the  former  does  not  prune  sufficiently — 
it  is  inferior  in  terms  of  both  accuracy  and  tree  size. 
Consequently,  we  also  ran  pre-pruning  using  p/  at  the 
0.005  and  0.001  levels,  and  found  that  the  performance 
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Table  4:  Number  of  times  p  performs  significantly  better  (+)  or  worse  (-)  than  (b)  pf,  (c)  p^,  (d)  post-pruned 
trees,  and  pruned  and  unpruned  C4.5  trees  with  respect  to  accuracy  and  tree  size 


Accuracy 

Tree  Size 

p 

+ 

- 

+ 

— 

Pfixed  —  0.01 

(b)  Pf 

8 

5 

17 

6 

(c)  Px 

9 

3 

8 

11 

(d)  post-pruned 

4 

14 

20 

7 

C4.5  pruned 

3 

17 

20 

7 

C4.5  unpruned 

11 

11 

31 

0 

Pfixed  ~  0.05 

(b)  Pf 

8 

2 

22 

3 

(c)  Px 

6 

6 

24 

2 

(d)  post-pruned 

4 

9 

8 

17 

C4.5  pruned 

2 

16 

11 

15 

C4.5  unpruned 

8 

9 

29 

2 

Pfixed  “  0.1 

(b)  Pf 

9 

2 

24 

1 

(c)  Px 

5 

5 

24 

0 

(d)  post-pruned 

4 

12 

5 

22 

C4.5  pruned 

3 

16 

3 

24 

C4.5  unpruned 

8 

8 

29 

2 

Table  5:  Number  of  times  p  with  gain  ratio  (Method  a')  performs  significantly  better  (-)-)  or  worse  (— )  than  p 
with  Pf  (Method  a) ,  and  pruned  and  unpruned  C4.5  trees 


Accuracy 

Tree  Size 

p  with  gain  ratio 

-1- 

- 

■+ 

Pfixed  “  0.01 

p  with  p f 

8 

3 

10 

10 

C4.5  pruned 

3 

14 

21 

6 

C4.5  unpruned 

13 

7 

30 

1 

Pfixed  —  0.05 

p  with  p f 

10 

4 

11 

14 

C4.5  pruned 

0 

10 

10 

14 

C4.5  unpruned 

12 

7 

30 

1 

pfixed  —  0.1 

p  with  Pf 

10 

5 

11 

12 

C4.5  pruned 

1 

15 

6 

22 

C4.5  unpruned 

13 

8 

30 

0 

difference  between  pf  and  p  can  not  be  eliminated  by 
adjusting  the  significance  level. 

Next,  observe  from  the  three  rows  marked  (c)  that  for 
the  0.01  significance  level,  pre-pruning  using  p  beats 
pre-pruning  using  p^  with  respect  to  the  accuracy 
of  the  resulting  trees.  For  this  significance  level  the 
two  methods  produce  trees  of  similar  size.  However, 
for  both  the  0.05  and  the  0.1  levels  p  produces  trees 
that  are  significantly  smaller  than  those  produced  by 
p^.  For  these  two  significance  levels  the  two  methods 
perform  comparably  as  far  as  accuracy  is  concerned. 
These  facts  indicate  that  for  both  the  0.05  and  the 
0.1  levels  py  is  a  more  liberal  test  than  p  if  applied 
to  attribute  selection  and  pre-pruning;  p^  stops  later 
than  p — as  for  the  artificial  dataset  used  in  Section 


3.2.  However,  it  is  sometimes  more  conservative — in 
particular  for  the  0.01  level.  The  two  tests  really  do 
behave  differently:  they  cannot  be  forced  to  behave 
in  the  same  way  by  adjusting  their  significance  levels. 
However,  the  results  show  that  trees  produced  by  p  are 
preferable  to  those  produced  by  p^. 

Table  4  also  shows  that  post-pruning  consistently 
beats  pre-pruning  using  p,  so  far  as  accuracy  is  con¬ 
cerned  (rows  marked  (d)).  Our  findings  show  that  all 
the  investigated  pre-pruning  methods  perform  signifi¬ 
cantly  worse  than  pessimistic  post-pruning.®  For  both 
the  0.01  and  the  0.05  levels,  there  are  five  datasets 


®This  contradicts  a  previous  result  (Martin,  1997)  that 
trees  pre-pruned  using  p/  are  as  accurate  as,  and  smaller 
than,  trees  post-pruned  using  pessimistic  pruning. 
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on  which  all  pre-pruning  methods  consistently  per¬ 
form  significantly  worse  than  post-pruning:  hypothy¬ 
roid,  kr-vs-kp,  sick,  splice,  and  vowel.  On  kr-vs-kp  and 
vowel  the  pre-pruning  methods  stop  too  early,  on  the 
other  three  they  stop  too  late.  This  means  that  the 
problem  cannot  be  solved  by  adjusting  the  significance 
level  of  the  pre-pruning  methods. 

For  reference  Table  4  also  includes  results  for  pruned 
and  unpruned  decision  trees  built  by  C4.5.  C4.5’s 
method  for  building  pruned  trees  differs  from  post- 
pruning  method  (d)  only  in  that  it  employs  the  gain 
ratio®  instead  of  p/  for  attribute  selection. 

Suprisingly,  Table  4  shows  that  p  does  not  perform 
better  than  C4.5’s  unpruned  trees  as  far  as  accuracy  is 
concerned,  although  p  performs  better  than  unpruned 
trees  built  using  p/  (results  not  shown).  This  indicates 
that  the  gain  ratio  produces  more  accurate  trees  than 
Pf.  We  therefore  replaced  attribute  selection  using  p/ 
in  the  second  step  of  pre-pruning  method  (a)  by  selec¬ 
tion  based  on  the  gain  ratio.  As  Table  5  shows,  the  new 
method  (a')— selection  based  on  the  gain  ratio  with 
prior  significance  testing  using  the  Freeman  and  Hal- 
ton  test  p — indeed  performs  better  than  method  (a), 
and  it  also  outperforms  C4.5’s  unpruned  trees.  How¬ 
ever,  as  Table  5  also  shows,  post-pruning — in  this  case 
represented  by  C4.5’s  pruned  trees — still  consistently 
beats  pre-pruning  using  p. 

4  Related  Work 

Several  researchers  have  applied  parametric  statistical 
tests  to  attribute  selection  in  decision  trees  (White  & 
Liu,  1994;  Kononenko,  1995)  and  proposed  remedies 
for  their  shortcomings  (Martin,  1997).  These  are  re¬ 
viewed  in  the  next  section.  Following  that  we  discuss 
work  on  permutation  tests  for  machine  learning,  none 
of  which  has  been  concerned  with  attribute  selection 
in  decision  trees. 

4.1  Use  of  Statistical  Tests  for  Attribute 
Selection 

White  and  Liu  (1994)  compare  several  entropy-based 
selection  criteria  to  parametric  tests  that  rely  on  the 
chi-squared  distribution.  More  specifically,  they  com¬ 
pared  the  entropy-based  measures  to  parametric  tests 
based  on  both  the  chi-squared  and  log  likelihood  ra¬ 
tio  statistics.  They  conclude  that  each  of  the  entropy 

®More  precisely,  it  selects  the  attribute  with  maximum 
gain  ratio  among  the  attributes  with  more  than  average 
information  gain. 


measures  favors  attributes  with  larger  numbers  of  val¬ 
ues,  whereas  the  statistical  tests  do  not  suffer  from  this 
problem.  However,  they  also  mention  the  problem  of 
small  expected  frequencies  with  parametric  tests  and 
suggest  the  use  of  Fisher’s  exact  test  as  a  remedy.  The 
extension  of  Fisher’s  exact  test  to  r  x  c  tables  is  the 
Freeman  and  Halton  test  that  we  have  used  above. 

Kononenko  (1995)  repeated  and  extended  these  exper¬ 
iments  and  investigated  several  other  attribute  selec¬ 
tion  criteria  as  well.  He  shows  that  the  parametric  test 
based  on  the  log  likelihood  ratio  is  biased  towards  at¬ 
tributes  with  many  values  if  the  number  of  classes  and 
attribute  values  relative  to  the  number  of  instances 
exceed  the  corresponding  figures  considered  by  White 
and  Liu  (1994).  This  is  not  surprising:  it  can  be  traced 
to  the  problem  of  small  expected  frequencies.  For  the 
log  likelihood  ratio  the  effect  is  more  pronounced  than 
for  the  chi-squared  statistic  (Agresti,  1990). 

Kononenko  also  observes  another  problem  with  sta¬ 
tistical  tests.  The  restricted  floating-point  precision 
of  most  computer  arithmetic  makes  it  difficult  to  use 
them  to  discriminate  between  different  informative  at¬ 
tributes.  The  reason  for  this  is  that  the  association 
to  the  class  is  necessarily  highly  significant  for  all  in¬ 
formative  attributes.^®  However,  there  is  an  obvious 
solution,  which  we  pursue  in  this  paper:  once  it  has 
been  established  that  an  attribute  is  significant,  it  can 
be  compared  to  other  significant  attributes  using  an  at¬ 
tribute  selection  criterion  that  measures  the  strength 
of  the  association. 

Recently,  Martin  (1997)  used  the  exact  probability  of 
a  contingency  table  given  its  marginal  totals  pj  for  at¬ 
tribute  selection  and  pre-pruning.  Our  method  differs 
from  his  only  in  that  we  employ  a  significance  test, 
based  on  pf  but  not  identical  to  it,  to  determine  the 
significance  of  an  attribute  before  selecting  the  best  of 
the  significant  attributes  according  topj.  As  Section  3 
of  this  paper  establishes,  direct  use  of  p/  for  attribute 
selection  produces  biased  results. 

4.2  Use  of  Permutation  Tests  in  Machine 
Learning 

Apparently  the  first  to  use  a  permutation  test  for  ma¬ 
chine  learning,  Gaines  (1989)  employs  an  approxima¬ 
tion  to  Fisher’s  exact  test  to  judge  the  quality  of  rules 
found  by  the  INDUCT  rule  lezirner.^^  Instead  of  the 

*°The  probability  that  the  null  hypothesis  of  no  asso¬ 
ciation  between  attribute  and  class  values  is  incorrectly 
rejected  is  very  close  to  zero. 

‘^He  uses  the  one-tailed  version  of  Fisher’s  exMt  test. 
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Figure  1:  Two  2  x  2-tables  which  both  optimize  the 
test  statistic 


hypergeometric  distribution  he  uses  the  binomial  dis¬ 
tribution,  which  is  a  good  approximation  if  the  sam¬ 
ple  size  is  small  relative  to  the  population  size  (smaller 
than  10  percent). 

Jensen  (1992)  gives  an  excellent  introduction  to  per¬ 
mutation  tests. He  discusses  several  alternatives, 
points  out  their  weaknesses,  and  deploys  the  method¬ 
ology  in  a  prototypical  rule  learner.  However,  he  does 
not  mention  the  prime  advantage  of  permutation  tests, 
which  makes  them  especially  interesting  in  the  context 
of  decision  trees:  their  applicability  to  small-frequency 
domains. 

5  Conclusions 

We  have  applied  an  approximate  permutation  test 
based  on  the  multiple  hypergeometric  distribution  to 
attribute  selection  and  pre-pruning  in  decision  trees, 
and  explained  why  it  is  preferable  to  tests  based  on 
the  chi-squared  distribution.  We  have  shown  that  us¬ 
ing  the  exact  probability  of  a  contingency  table  given 
its  marginal  totals  without  a  prior  significance  test  is 
biased  towards  attributes  with  many  values  and  per¬ 
forms  worse  in  comparison.  Although  we  were  able  to 
improve  on  existing  methods  for  pre-pruning,  we  could 
not  achieve  the  same  accuracy  as  post-pruning. 

Apart  of  the  standard  explanation  that  pre-pruning 
misses  hidden  attribute  interactions,  there  are  two 
other  possible  reasons  for  this  result.  The  first  is  that 
we  did  not  adjust  for  multiple  comparisons  when  test¬ 
ing  the  significance  of  an  attribute.  Recently,  Jensen 
and  Schmill  (1997)  showed  how  to  reduce  the  size  of 
a  post-pruned  tree  significantly  by  taking  multiple  hy¬ 
potheses  into  account  using  a  technique  known  as  the 
“Bonferroni  correction.”  The  second  reason  is  that 
tests  for  r  X  c  contingency  tables  are  inherently  multi¬ 
sided.  Consider  the  table  shown  at  the  left  of  Fig¬ 
ure  1,  which  corresponds  to  a  perfect  classification  of 
two  classes  using  an  attribute  with  two  values.  There  is 
another  permutation  of  class  labels,  shown  at  the  right, 
that  also  results  in  a  contingency  table  with  the  same 
optimum  value  of  the  test  statistic.  The  significance 

uses  the  term  “randomization  test”  instead  of  per¬ 
mutation  test. 


level  achieved  by  the  original  table  is  only  half  as  great 
as  it  would  be  if  there  were  only  one  table  that  opti¬ 
mized  the  test  statistic.  In  the  case  of  two  attributes 
and  two  classes,  the  one-sided  version  of  Fisher’s  exact 
test  avoids  this  problem.  Generalizing  this  to  the  rxc 
case  appears  to  be  an  open  problem. 
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Table  6:  Experimental  results:  percentage  of  correct  classifications,  and  standard  deviation  using  p,  pf,p^,  post- 
pruned  trees,  p  with  gain  ratio,  C4.5’s  pruned  trees,  and  C4.5’s  unpruned  trees.  Because  of  space  constraints, 
we  could  only  include  results  for  one  of  the  three  pfixed  vales  used  in  Table  4:  we  chose  pfixed  =  0.05.  In  the  last 
six  columns,  figures  are  marked  with  •  if  they  are  significantly  worse  than  the  corresponding  results  for  p,  and 
with  o  if  they  are  significantly  better. 


p 

PI 

Px 

post- 

pruned 

p  with 
gain  ratio 

C4.5 

unpruned 

anneal 

98.6±0.1 

98.5±0.0  • 

99.0±0.1  o 

98.4±0.1  • 

98.3±0.3 

98.0±0.3  • 

98.3±0.3 

audiology 

71.6±1.9 

70.3±1.9  • 

71.5±1.7 

71.9±1.3 

73.8±1.2  0 

74.8±1.0  0 

74.8dbl.3  0 

australian 

85.7±0.5 

86.7±0.5  0 

85.0±0.5  • 

86.4±0.0  0 

84.8±0.5  • 

85.2±0.4 

83.8±1.0  • 

autos 

67.3±2.2 

67.2±2.4 

72.7±2.4  0 

70.5±2.4  0 

73.3±2.3  0 

73.0±2.0  0 

72.9±2.3  0 

balance-scale 

66.1±0.9 

70.5±1.2  o 

65.9±1.2 

67.3±1.0 

67.2±1.2  0 

67.9±1.0  0 

74.1±1.0  0 

breast-cancer 

69.0±1.5 

65.0±1.4  • 

69.8±1.2 

67.6±1.1 

72.5±1.1  o 

74.4±1.2  0 

66.6±1.4  • 

breast-w 

95.2±0.7 

95.1±0.6 

95.0±0.7 

95.2±0.6 

95.7±0.3 

96.0±0.3  o 

95.6±0.3 

german 

70.3±0.7 

70.4±0.7 

70.4±1.1 

70.5±0.5 

70.5±0.8 

70.9±0.8 

67.2±1.2  • 

glass  (G2) 

70.5±4.3 

70.6±2.5 

70.5±3.3 

71.3±1.7 

67.3±2.5 

79.7±1.4  0 

79.5±1.6  o 

glass 

59.8±1.4 

59.3±1.4 

59.6±1.1 

60.2±1.3 

60.1±1.6 

59.9db2.1 

59.3±1.4 

heart-c 

78.2±1.1 

76.8±1.4 

76.6±0.9  • 

79.2±2.4 

77.0±1.2 

77.5±1.2 

75.1±1.4  • 

heart-h 

73.9±0.9 

72.6±1.6 

74.8±1.2 

73.7±0.9 

77.8±1.2  o 

79.5±0.8  o 

76.6±1.0  0 

heart-statlog 

79.2±1.5 

77.7±1.7  • 

78.1±1.9  • 

80.1±0.7 

76.2±1.6  • 

78.5±1.9 

75.7±2.0  • 

hepatitis 

79.8±2.4 

79.5±2.2 

79.5±1.7 

80.7±1.6 

84.4±1.8  o 

84.4±1.3  0 

80.7±1.4 

hypothyroid 

91.7±0.1 

91.7±0.0 

91.7±0.0 

91.9±0.0  0 

91.7±0.0 

91.9±0.0  0 

91.7±0.1 

ionosphere 

87.0±1.0 

86.7±0.8 

87.4±0.8 

88.1±0.5  0 

87.8±1.4 

87.2±0.6 

86.6±0.7 

iris 

91.8±0.3 

91.5±0.9 

91.8±0.3 

91.5±0.8 

91.9±0.2 

91.5±0.9 

90.7±1.1 

kr-vs-kp 

99.3±0.1 

99.3±0.1 

99.3±0.1 

99.4±0.1  0 

99.3±0.1 

99.5±0.1  0 

99.5±0.1  0 

lymphography 

75.2±0.8 

76.3±2.1 

75.2±1.5 

76.0±2.4 

76.1±1.6 

78.6±1.6  0 

75.8±2.0 

mushroom 

lOO.OiO.O 

lOO.OdtO.O 

100.0±0.0 

lOO.OiO.O 

lOO.OiO.O 

100.0±0.0 

lOO.OiO.O 

pima-indians 

74.0±0.8 

72.9±0.7 

74.2±0.5 

71.9±0.4  • 

74.1±0.6 

74.1±0.5 

69.4±0.8  • 

primary-tumor 

39.8±1.1 

36.1±1.4  • 

37.6±1.4  • 

35.7±1.4  • 

38.7±1.9 

40.0±0.5 

40.3±1.1 

segment 

91.0±0.2 

91.2±0.3 

91.1±0.2  o 

91.3±0.2 

91.5±0.3  0 

91.8±0.2  o 

91.8±0.3  0 

sick 

93.3±0.1 

93.3±0.1 

93.2±0.1  • 

93.4±0.0  o 

93.3±0.1 

93.4±0.0  o 

93.2±0.1  • 

sonar 

68.8±2.5 

68.3±2.5 

68.6±3.5 

69.1±2,4 

70.3±2.6 

71.5±2.2 

70.5±3.1 

soybean 

75.1±0.8 

72.2±0.8  • 

76.1±0.7  o 

73.5±0.6  • 

77.6±0.5  o 

77.7±0.5  0 

76.7±0.7  o 

splice 

92.6±0.3 

92.3±0.3  • 

92.2±0.3  • 

93.4±0.2  o 

93.2±0.2  o 

94.2±0.2  0 

92.2±0.2  • 

yehicle 

63.4±0.9 

62.0±0.6  • 

64.1±1.0  o 

64.2±0.7 

65.7±0.7  o 

66.1±0.5  0 

64.2±0.7 

yote 

95.4±0.4 

95.5±0.4 

95.5±0.3 

95.6±0.5 

95.5±0.4 

95.5±0.4 

96.2±0.5  o 

yowel 

77.9±1.0 

78.0±1.0 

79.5±1.0  o 

80.8±1.0  o 

73.8±0.6  • 

76.6±0.5  • 

78.2±0.7 

ZOO 

92.5±1.8 

92.8±1.6 

94.0±2.0 

94.8±2.1  0 

89.6±1.4  • 

90.8±1.5 

91.5dbl.4 
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Abstract 

Information  extraction  (IE)  is  the  problem 
of  filling  out  pre-defined  structured  sum¬ 
maries  from  text  documents.  We  are  in¬ 
terested  in  performing  IE  in  non-traditional 
domains,  where  much  of  the  text  is  often 
ungrammatical,  such  as  electronic  bulletin 
board  posts  and  Web  pages.  We  suggest  that 
the  best  approach  is  one  that  takes  into  ac¬ 
count  many  different  kinds  of  information, 
and  argue  for  the  suitability  of  a  multistrat¬ 
egy  approach.  We  describe  learners  for  IE 
drawn  from  three  separate  machine  learning 
paradigms:  rote  memorization,  term-space 
text  classification,  and  relational  rule  induc¬ 
tion.  By  building  regression  models  mapping 
from  learner  confidence  to  probability  of  cor¬ 
rectness  and  combining  probabilities  appro¬ 
priately,  it  is  possible  to  improve  extraction 
accuracy  over  that  achieved  by  any  individ¬ 
ual  learner.  We  describe  three  different  mul¬ 
tistrategy  approaches.  Experiments  on  two 
IE  domains,  a  collection  of  electronic  seminar 
announcements  from  a  university  computer 
science  department  and  a  set  of  newswire  ar¬ 
ticles  describing  corporate  acquisitions  from 
the  Reuters  collection,  demonstrate  the  effec¬ 
tiveness  of  all  three  approaches. 

1  INTRODUCTION 

Information  extraction  (IE)  poses  the  following  prob¬ 
lem:  Suppose  each  document  in  a  collection  describes 
some  entity  or  event  drawn  from  a  semantically  coher¬ 
ent  domain.  For  example,  the  collection  may  consist  of 
newswire  articles  describing  terrorist  attacks  in  Latin 


America,  or  of  personal  home  pages  from  a  univer¬ 
sity  computer  science  departments.  Given  a  document 
from  the  collection  and  a  set  of  questions  defined  for 
the  domain,  find  the  answer  to  each  question  in  the 
form  of  a  fragment  of  text  from  the  document.  In  the 
case  of  articles  on  terrorism,  the  object  might  be  to 
find  the  title  of  the  group  responsible  for  the  attack, 
the  instrument  of  the  attack,  and  the  victim’s  name; 
from  home  pages,  we  might  seek  to  extract  the  owner’s 
name,  home  address,  and  university  affiliation. 

There  are  many  possible  uses  for  a  successful  IE  sys¬ 
tem.  As  a  front  end,  an  IE  system  can  enable  database 
mining  and  knowledge  discovery  in  textual  domains, 
where  such  processing  would  otherwise  be  limited  or 
impossible.  In  hypertext,  it  can  support  directed  and 
efficient  automatic  navigation.  It  can  serve  as  a  source 
of  high-quality  features  for  document  categorization. 
And  the  output  of  an  IE  system  can  be  viewed  as  a 
kind  of  succinct  and  directed  summarization. 

Although  traditional  IE  (Cowie  &  Lehnert,  1996) 
concentrates  on  domains  consisting  of  grammatical 
prose,  we  are  interested  in  extracting  information  from 
“messy”  text,  such  as  Web  pages,  email,  and  fin¬ 
ger  plan  files.  Our  goal  is  the  development  of  ma¬ 
chine  learning  methods  for  such  domains.  To  per¬ 
form  well,  these  methods  must  be  prepared  to  ex¬ 
ploit  non-linguistic  information,  such  as  stock  phrases, 
document  formatting,  meta-textual  structure  (e.g.,  in 
HTML),  and  term  frequency  statistics. 

Several  learning  IE  systems  have  been  proposed  which 
are  also  targeted  at  such  domains  (Soderland,  1997) 
(Califf  &  Mooney,  1997)  (Kushmerick,  1997).  These 
previous  investigations  all  take  a  single  approach  or 
attack  a  particular  kind  of  domain.  However,  given 
the  wealth  of  information  in  a  typical  document  and 
the  difficulty  of  adequately  representing  this  informa¬ 
tion  for  learning,  we  surmise  that  no  individual  learn- 
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ing  approach  is  best  for  all  IE  problems.  An  individual 
learner  embodies  biases  that  make  it  more  suitable  for 
some  kinds  of  information  and  aspects  of  a  problem 
than  for  others.  A  statistical  learner  like  Naive  Bayes, 
for  example,  is  useful  for  problems  in  which  each  fea¬ 
ture  contributes  some  evidence  toward  the  determina¬ 
tion  of  class  membership,  and  in  which  violations  of 
the  independence  assumption  do  not  predominate.  It 
is  less  suitable  for  problems  involving  elaborate  fea¬ 
ture  sets,  in  which  some  features  are  abstractions  or 
combinations  of  others  (i.e.,  where  the  independence 
assumption  is  directly  violated).  Symbolic  learners, 
on  the  other  hand,  work  quite  well  for  problems  with 
elaborate  feature  sets,  especially  for  those  classes  ex¬ 
pressible  in  logical  terms  using  a  small  subset  of  fea¬ 
tures.  These  considerations  suggest  at  multistrategy 
approach. 

Multistrategy  learning  is  an  attempt  to  devise  sys¬ 
tems  which,  by  employing  multiple  constituent  learn¬ 
ers,  which  are  typically  drawn  from  diverse  paradigms, 
achieve  performance  superior  to  any  single  learner 
(Michalski  &  Tecuci,  1994).  The  bulk  of  emphasis 
in  past  research  in  this  area  has  been  on  systems 
which  combine  analytical  and  empirical  techniques. 
Our  work,  however,  is  an  example  of  what  has  been 
called  “empirical  multistrategy  learning”  (Domingos, 

1996) .  All  constituent  learners  are  inductive,  each  de¬ 
signed  to  solve  the  IE  problem  individually.  Elsewhere 
we  have  shown  that  heuristic  combination  of  two  learn¬ 
ers  from  different  paradigms  can  yield  substantial  per¬ 
formance  improvements  for  the  IE  problem  (Freitag, 

1997) .  Here,  we  ask  how  we  might  profitably  com¬ 
bine  component  learners  by  treating  them  as  black¬ 
boxes.  This  approach  has  been  called  “meta-learning” 
in  the  literature  (Chan  &  Stolfo,  1993).  Although  we 
might  expect  a  heuristic  combination  to  achieve  better 
performance,  there  are  clear  advantages  to  the  meta¬ 
learning  approach.  It  is  modular  and  flexible,  making 
no  assumptions  about  the  design  of  component  learn¬ 
ers  or  the  number  of  learners  available. 

In  this  paper,  we  introduce  three  machine  learning  al¬ 
gorithms  for  IE,  each  drawn  from  a  different  paradigm 
and  each  suitable  for  particular  kinds  of  IE  problems. 
Next,  we  describe  three  ways  of  combining  the  basic 
learners,  all  variations  of  the  meta-learning  idea.  Fi¬ 
nally,  we  describe  a  set  of  experiments  on  two  IE  do¬ 
mains. 


2  LEARNING  TO  EXTRACT 

In  the  simplest  version  of  the  information  extraction 
problem,  a  single  set  of  questions  is  applied  to  each 
document  in  a  domain,  and  a  single  text  fragment  is 
sought  as  the  answer  to  each  question.  We  call  a  sin¬ 
gle  question  a  field-,  the  answer  fragment  from  an  in¬ 
dividual  document  is  a  field  instance  or  instantiation. 
For  example,  in  a  domain  consisting  of  newswire  arti¬ 
cles  describing  terrorist  attacks,  one  field  might  be  the 
perpetrator  of  the  attack,  and  the  instantiation  of  this 
field  in  a  given  article  might  be  “FMLN.” 

A  field  can  be  formalized  as  a  function  iF(D)  =  (fcfc.te) 
that  maps  a  document  to  the  boundaries  of  a  text  frag¬ 
ment  (bb  and  be  are  the  indexes  of  the  beginning  and 
ending  boundary  terms,  respectively).  Given  a  set  of 
documents  in  which  this  mapping  is  labeled,  the  goal 
of  a  ML  system  is  to  learn  the  function  !F  that  best 
approximates  T .  This  can  be  realized  in  the  form  of 
an  auxiliary  function  Q(D,bb,be)  =  RUjnil},  which, 
given  a  candidate  fragment,  either  returns  a  confidence 
that  it  is  a  field  instance  or  declines  to  issue  a  confi¬ 
dence  (nil).  The  form  of  G  has  a  convenient  affinity 
with  any  number  of  ML  algorithms  (the  nil  in  its  range 
constitutes  a  failure  to  match,  for  algorithms  that  in¬ 
clude  a  notion  of  matching).  The  three  approaches  we 
will  discuss,  all  based  on  standard  ideas  from  ML,  each 
implement  G- 

Note  that  this  learning  task  is  only  a  part  of  the 
functionality  of  a  typical  participating  system  at  the 
Message  Understanding  Conference  (MUC)  (Cardie, 
1997).  What  we  have  called  fields  correspond  to  slots 
in  the  MUC  setting.  A  slot  is  a  component  of  a  larger 
structure,  called  a  template,  which  summarizes  the  rel¬ 
evant  information  contained  in  a  document.  In  ad¬ 
dition  to  the  slot-filling  task,  which  we  address  here, 
the  more  general  MUC  problem  includes  tasks  such 
as  document  relevance  determination,  discourse  anal¬ 
ysis,  and  template  merging.  Thus,  our  results  are  best 
regarded  as  a  piece  of  the  larger  IE  puzzle. 

2.1  ROTE  LEARNING 

Perhaps  the  simplest  possible  learning  approach  to 
the  IE  problem  is  to  memorize  field  instances  verba¬ 
tim.  Presented  with  a  novel  document,  this  memo¬ 
rizing  learner  simply  matches  text  fragments  against 
its  “learned”  dictionary,  saying  “field  instance”  to  any 
matching  fragments  and  rejecting  all  others. 

As  a  slightly  more  sophisticated  approach,  we  can  es¬ 
timate  the  probability  that  the  matched  fragment  is 
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indeed  a  field  instance.  The  dictionary  learner  we  ex¬ 
periment  with  here,  which  we  call  Rote,  does  exactly 
this.  Training  Rote  involves  scanning  the  training  cor¬ 
pus  and  storing  all  distinct  field  instances  verbatim 
in  its  dictionary.  Dictionary  construction  is  followed 
by  a  second  pass  through  the  training  corpus.  For 
each  text  fragment  in  its  dictionary,  Rote  counts  the 
number  of  times  it  appears  as  a  field  instance  (pos) 
and  the  number  of  times  it  occurs  over  all  (tot) .  Dur¬ 
ing  test,  Rote’s  confidence  in  a  prediction  is  the  value 
{pos  +  1) / {tot  +  2) ,  i.e.,  a  Laplace  estimate  that  the 
matching  fragment  is  genuine. 

This  approach,  simple  as  it  is,  is  nevertheless  surpris¬ 
ingly  applicable  in  a  wide  variety  of  domains.  Its  confi¬ 
dence,  moreover,  correlates  well  with  actual  probabil¬ 
ity  of  correctness.  Because  of  this,  even  low-confidence 
predictions  are  potentially  useful. 

2.2  TERM-SPACE  LEARNING 

It  is  straightforward  to  adapt  ideas  from  document 
classification  to  the  IE  setting.  A  simple  mapping 
might  transform  every  field  instance  into  a  miniature 
“document”  and  apply  “bag-of-words”  algorithms  di¬ 
rectly,  such  as  Rocchio  with  TFIDF  term  weighting  or 
Naive  Bayes.  Such  an  approach  could  be  viewed  as  a 
generalization  of  Rote. 

In  contrast  with  document  classification,  however,  pos¬ 
itive  examples  in  an  IE  setting  always  occur  embedded 
within  some  larger  context.  This  context  is  often  crit¬ 
ical  in  disambiguating  field  instances  from  other  frag¬ 
ments.  Although  it  is  hard  to  exploit  contextual  regu¬ 
larities  by  memorizing,  statistical  approaches  are  well 
suited  for  this. 

We  base  our  bag-of-words  learner,  which  we  call  Bayes, 
on  the  Naive  Bayes  algorithm,  as  used  in  document 
classification  and  elsewhere  (originally  in  (Maron, 
1961)).  Each  fragment  of  text  in  a  document  (of  ap¬ 
propriate  size)  is  regarded  as  a  competing  hypothesis. 
Given  a  document,  we  want  to  find  the  most  likely 
hypothesis  (the  fragment  most  likely  to  be  a  field  in¬ 
stance)  .  Bayes  Rule  tells  us  how  to  maintain  our  belief 
in  a  set  of  disjoint  hypotheses  {Hi)  in  reaction  to  ob¬ 
served  data  {D): 

Pi{D\Hi)PT{Hi 

'  ^  E”=iPr(D|Lr,-)Pr( 


As  in  Naive  Bayes  as  used  elsewhere,  the  important 
terms  to  estimate  are  Pr(iLj)  (the  prior  probability) 
and  Pr(D|iL,)  (the  conditional  data  probability). 


We  assume  a  hypothesis  takes  the  form,  “the  field 
instance  starts  at  token  s  and  is  k  tokens  long”  (let 
represent  such  a  hypothesis).  In  other  words,  a 
single  hypothesis  consists  of  two  parts,  position  and 
length.  We  can  estimate  the  probability  of  a  partic¬ 
ular  position  or  length  from  training  data.  In  our 
implementation  we  treat  these  two  estimates  as  in¬ 
dependent,  which  is  different  from  the  typical  Naive 
Bayes  data  independence  assumption,  but  similar  in 
spirit.  Thus,  our  prior  PT{Hs,k)  is  simply  the  product 
of  Pr(position  =  s)  and  Pr(length  =  k). 

Bayes’s  data  likelihood  estimate,  Pr(D|iLj^/;),  is  based 
on  the  terms  that  occur  in  and  around  the  text  frag¬ 
ment  to  which  Hs^k  corresponds.  This  estimate  is 
formed  in  a  way  similar  to  Naive  Bayes  for  document 
classification  (a  product  of  individual  term  estimates), 
but  with  a  few  modifications  for  the  IE  setting.  In 
particular,  a  context  window  parameter  w  is  set  prior 
to  training,  and  the  w  tokens  on  either  side  of  a  frag¬ 
ment  are  used  to  form  the  estimate,  in  addition  to  the 
in-field  tokens.  The  algorithm  is  described  in  greater 
detail  elsewhere  (Freitag,  1997). 

2.3  RELATIONAL  LEARNING 

Both  Bayes  and  Rote  are  hobbled  by  their  inability  to 
take  into  account  anything  but  simple  term  frequency 
statistics.  It  may  be  the  case,  however,  that  the  in¬ 
formation  needed  to  perform  information  extraction 
comes  in  other  forms.  More  abstract  clues  may  be 
important,  such  as  linguistic  syntax,  document  lay¬ 
out,  or  simple  orthography.  In  addition,  statistical 
approaches  like  Bayes  work  by  summing  all  available 
evidence,  whereas  in  IE  a  more  fruitful  approach  may 
involve  identifying  simple  patterns  that  serve  to  dis¬ 
tinguish  sub-classes  of  a  field. 

Symbolic  learning  algorithms  from  the  “covering”  fam¬ 
ily  form  hypotheses  that  match  such  data  spaces  well. 
Previous  research  has  shown  the  effectiveness  of  such 
methods  for  the  IE  problem  (Soderland,  1996)  (Califf 
&  Mooney,  1997).  Our  relational  learner,  called  SRV, 
is  a  variant  of  FOIL  (Quinlan,  1990).  Its  example 
space  consists  of  all  text  fragments  from  the  train¬ 
ing  document  collection  as  long  (in  number  of  tokens) 
as  the  smallest  field  instance  in  the  training  corpus 
but  no  longer  than  the  largest.  A  negative  example 
is  any  fragment  that  is  not  tagged  as  a  field  instance. 
Note  that  this  includes  fragments  that  contain,  are 
contained  by,  and  overlap  with  field  instances. 

Induction  proceeds  as  with  FOIL:  Starting  with  a  null 
rule  that  matches  all  examples  not  covered  by  previ- 
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ously  learned  rules,  SRV  greedily  adds  predicates  us¬ 
ing  foil’s  information  gain  metric.  In  addition  to 
the  tagged  document  collection,  SRV  takes  as  input  a 
set  of  features  to  use  in  conducting  search.  These  fea¬ 
tures  come  in  two  varieties,  simpk  features,  which  map 
from  an  individual  token  to  an  arbitrary  value  (e.g., 
capitalized?  or  noun?),  and  relational  features,  which 
map  from  a  token  to  another  token  (e.g.,  next-token  or 
subject-verb). 

An  individual  predicate  in  SRV  belongs  to  one  of  a  few 
predefined  types: 

•  length(Relop  N):  The  number  of  tokens  in  a  frag¬ 
ment  is  less  than,  greater  than,  or  equal  to  some 
integer. 

•  some(Var  Path  Feat  Value):  This  is  a  feature- 
value  test  for  some  token  in  the  sequence  (e.g., 
“the  fragment  contains  some  token  that  is  cap¬ 
italized”).  One  argument  to  this  predicate  is  a 
variable.  For  a  rule  to  match  a  text  fragment, 
each  distinct  variable  in  a  rule  (used  in  this  or  ei¬ 
ther  of  the  position  predicates  below)  must  bind 
to  a  distinct  token  in  the  fragment. 

•  every(Feat  Value):  Every  token  in  a  fragment 
passes  some  feature- value  test  (e.g.,  “every  token 
in  the  fragment  is  non-numeric”). 

•  position(Var  From  Relop  N):  This  constrains  the 
position  of  a  token  bound  by  a  some-predicate  in 
the  current  rule.  The  position  is  specified  relative 
to  the  beginning  or  end  of  the  sequence. 

•  relpos(Varl  Var2  Relop  N):  This  constrains  the  or¬ 
dering  and  distance  between  two  tokens  bound  by 
distinct  variables  in  the  current  rule. 

Relational  features  are  used  only  in  the  Path  argu¬ 
ment  to  the  some  predicate.  This  argument  can  be 
empty,  in  which  case  the  some  predicate  is  asserting  a 
feature- value  test  for  a  token  actually  occurring  within 
a  field,  or  it  can  be  a  list  of  relational  features.  In  the 
latter  case,  it  is  positing  both  a  relationship  about  a 
field  token  with  some  other  nearby  token,  as  well  as 
a  feature- value  for  the  other  token.  For  example,  the 
assertion: 

some(?A  [prev-token  prev-token]  capitalized  true) 

amounts  to  the  English  statement,  “There  is  some  to¬ 
ken  preceded  by  a  capitalized  token  two  tokens  back.” 


0.85  0.63  0.45  0.77 


Figure  1:  Hypothetical  Extraction  of  a  Seminar  Lo¬ 
cation.  Each  box  style  is  intended  to  represent  a  dif¬ 
ferent  learner.  By  combining  evidence  from  multiple 
learners,  we  can  correct  for  the  mistakes  of  individual 
learners. 

In  order  to  enable  SRV  to  return  confidences  with  its 
predictions,  training  is  followed  by  a  validation  step. 
Rather  than  train  on  the  entire  training  collection,  we 
set  aside  a  fraction  of  the  documents  (one-third  here) 
for  validation.  With  each  rule  learned  by  SRV  we  store 
its  performance  on  the  hold-out  set.  From  this  perfor¬ 
mance  we  estimate  a  rule’s  actual  accuracy.  The  confi¬ 
dence  of  a  prediction  made  by  SRV  is  formed  from  the 
estimated  accuracy  of  matching  rules.  For  additional 
details  on  SRV,  please  refer  to  (Freitag,  1998). 

3  COMBINING  LEARNERS 

Certain  features  of  the  IE  problem  make  it  particularly 
amenable  to  a  multistrategy  approach.  Among  these 
arc  the  following: 

•  Examples  have  multiple  representations. 

Because  documents  and  text  fragments  are  “nat¬ 
ural”  objects  which  must  be  mapped  to  appro¬ 
priate  representations  for  learning,  multiple  map¬ 
pings  are  possible.  Although  some  information  is 
necessarily  lost  in  any  one  mapping,  we  can  hope 
that  taking  multiple  views  of  a  document  will  per¬ 
mit  better  overall  performance. 

•  The  problem  is  essentially  Boolean.  As  out¬ 
lined  above,  performing  extraction  can  be  reduced 
to  the  task  of  accepting  or  rejecting  candidate 
text  fragments.  Consequently,  we  can  gauge  a 
learner’s  performance  on  validation  documents  in 
an  attempt  to  model  the  relationship  between  pre¬ 
diction  confidence  and  probability  of  correctness. 

•  Each  document  is  a  case  study.  In  contrast 
with  a  traditional  classification  problem,  each  per- 
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Final  prediction  scores 


Figure  2:  The  Basic  Combination  Scheme.  Regression 
models  based  on  learner  performance  on  hold-out  sets 
are  used  to  map  raw  confidence  scores  to  probabili¬ 
ties.  The  combiner  uses  these  probabilities  to  order 
all  predictions. 

formance  unit,  a  document,  is  a  collection  of  test 
problems.  Overgeneration,  the  problem  of  saying 
yes  to  too  many  text  fragments,  can  be  regarded 
as  an  asset  when  multiple  learners  are  available.  It 
both  affords  more  data  for  our  attempt  to  model 
a  learner’s  usefulness,  and  holds  forward  the  hope 
that  the  poor  predictions  of  a  single  learner  can  be 
corrected  by  checking  them  against  those  of  other 
learners. 

Figure  1  shows  a  hypothetical  excerpt  from  a  semi¬ 
nar  announcement  and  how  such  correction  might  take 
place. 

3.1  BASIC  COMBINATION  METHOD 

Within  the  constraint  that  all  learners  assign  a  con¬ 
fidence  to  any  predictions  they  make  (any  fragments 
they  accept),  a  wide  range  of  behaviors  is  possible. 
In  particular,  for  a  number  of  reasons,  we  cannot  as¬ 
sume  that  the  confidences  bear  any  resemblance  to 
true  probability  of  correctness,  or  even  that  they  are 
comparable  across  learners.  Bayes’s  confidences  are 
large  negative  log  probabilities,  for  example. 

We  do  assume,  however,  that  probability  of  correctness 
increases  with  increasing  confidence  for  all  learners. 
The  basic  idea,  therefore,  is  to  attempt  to  compute  a 
mapping  for  each  learner  from  confidence  to  probabil¬ 
ity  of  correctness.  Figure  2  shows  this  in  outline.  The 
specific  steps  involved  are; 

1.  Validate  performance  on  a  hold-out  set.  Re¬ 
serve  a  part  of  the  training  set  for  validation. 
After  training  each  learner,  store  its  predictions, 
with  confidences,  on  the  hold-out  set. 


2.  Use  regression  to  map  confidences  to  prob¬ 
abilities.  Based  on  the  learner’s  performance  on 
the  hold-out  set,  attempt  to  model  how  its  perfor¬ 
mance  varies  with  confidence.  What  is  modeled, 
and  the  kind  of  regression  used,  depends  on  the 
combination  method. 

3.  Use  the  regression  models  and  calculated 
probabilities  to  make  the  best  choice  on  the 
test  set. 

We  experimented  with  three  basic  methods  of  combi¬ 
nation.  The  first  two,  which  we  will  call  Max  and  Prob, 
both  attempt  to  work  with  regression  models  that  map 
directly  from  confidence  to  probability  of  correctness. 
The  third,  which  we  will  call  CBayes,  uses  Bayes  Rule 
to  make  combination  decisions. 

3.2  REGRESSION  TO  ESTIMATE 
CORRECTNESS 

If  a  learner’s  confidence  numbers  are  meaningful,  then 
the  probability  that  a  prediction  is  correct  will  increase 
with  increasing  confidence.  We  use  linear  regression 
to  model  the  rate  at  which  this  probability  increases. 
For  each  prediction  made  we  create  a  datapoint  {x,y), 
where  x  is  the  prediction  confidence,  and  y  is  1,  if 
the  prediction  was  correct  (the  corresponding  fragment 
was  a  field  instance),  else  0. 

The  result  is  a  line  equation  which  we  use  directly 
to  map  from  learner  confidence  to  probability  of  suc¬ 
cess.  Both  Max  and  Prob  use  the  resulting  estimates 
to  arbitrate  among  multiple  learners’  predictions  for  a 
document.  Estimates  are  computed  for  each  learner’s 
predictions,  and  the  prediction  with  the  highest  esti¬ 
mate  is  chosen  as  the  top  combined  prediction.  The 
two  methods  differ  only  in  how  they  handle  the  case  in 
which  multiple  learners  offer  predictions  for  the  same 
text  fragment.  In  such  an  event.  Max  simply  takes  the 
larger  estimate  as  the  probability  that  the  fragment  is 
a  field  instance. 

We  believe,  however,  that  the  fact  that  two  or  more 
learners  agree  on  a  prediction  provides  more  informa¬ 
tion  than  either  prediction  alone.  Indeed,  if  we  as¬ 
sume  that  two  probability  estimates  of  an  event.  Pa 
and  Pb,  are  independent,  then  the  combined  probabil¬ 
ity  is  the  probability  that  they  are  not  both  wrong,  i.e., 
1  —  (1  —  Pa)(l  —  Pb)-  Prob’s  estimate  is  based  on  this 
assumption.  Given  a  set  of  probability  estimates  Pi,  its 
estimate  for  the  combined  probability  is 
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3.3  BAYESIAN  PREDICTION 
COMBINATION 

Although  Prob  may  exploit  the  availability  of  predic¬ 
tions  from  multiple  learners  better  than  Max,  it  still 
leaves  something  to  be  desired.  In  particular,  it  ig¬ 
nores  some  of  the  available  information,  such  as  the 
frequency  with  which  a  learner  tends  to  predict  at  a 
given  confidence  level  and  any  notion  of  prior  proba¬ 
bilities. 

For  our  final  combination  method,  we  attempt  to  apply 
Bayes  Rule,  which  tells  us  how  to  maintain  our  prob¬ 
ability  estimates  in  response  to  incoming  data.  Using 
Bayes  Rule  offers  two  advantages  over  Prob:  It  allows 
us  to  incorporate  priors  into  our  estimates,  and  it  tells 
us  how  to  maintain  our  hypothesis  space  so  that  the 
resulting  estimates  are  closer  to  true  probabilities — an 
advantage  in  terms  of  the  accuracy-coverage  trade-off. 

Here,  a  hypothesis  Hi  takes  the  form,  “the  fragment 
at  this  place  in  the  document  is  a  field  instance."  Let 
Pat  =  C  be  the  event.  Learner  A  predicted  fragment 
i  is  a  field  instance  with  confidence  C.  For  each  frag¬ 
ment  i  chosen  by  any  of  the  learners,  we  maintain  two 
hypotheses  explicitly.  Hi  and  -i//,.  Individual  learner 
predictions  Pai  =  C  are  treated  as  events  which  cause 
us  to  update  hypotheses.  We  want,  therefore,  to  model 
Pr(Pai  =  C\Hi)  and  Pr(Pai  =  C\^Hi).  It  is  more  con¬ 
venient,  however,  to  model  the  event  Pai  >=  C,  i.e., 
the  probability  of  a  prediction  with  confidence  at  least 
C.  Modeling  the  cumulative  probability  yields  better 
statistics  and  allows  us  to  avoid  the  arbitrary  decisions 
inherent  in  binning. 

We  use  exponential  regression  to  model  these  two 
probabilities,  i.e.,  we  perform  linear  regression  on  pairs 
of  the  form  (a;,  log(j/)),  where  a;  is  a  confidence  level, 
and  y  is  the  cumulative  probability  of  seeing  a  predic¬ 
tion  for  a  fragment  given  that  it  either  is  or  is  not  a 
field  instance.  As  an  example,  consider  the  problem 
of  creating  the  “positive”  model  Pr(Pai  >=  C\Hi)  for 
some  learner  A.  Let  P  be  the  total  number  of  field 
instances  in  the  validation  set,  and  let  Ga{C)  be  the 
number  of  field  instances  identified  by  Learner  A  with 
predictions  having  confidence  equal  to  or  greater  than 
C.  For  every  prediction  made  by  Learner  A,  w'e  add  a 
regression  datapoint  (*,  log(j/)),  where  x  is  the  confi¬ 
dence  of  the  prediction  and  y  =  Ga{x)/F.  The  “neg¬ 
ative”  model  Pr(Pai  >=  c\-^Hi)  is  constructed  in  the 
same  way,  except  over  non-field-instance  fragments — 
any  fragment  in  the  validation  set  identified  by  any  of 
the  learners.  We  settled  on  exponential  regression  em¬ 
pirically,  but  it  is  easy  to  see  why  it  works  better  than 


Table  1:  Accuracy-Coverage  Results  for  the  Seminar 
Announcement  Domain. 


speak< 

Acc 

sr 

Cov 

locati 

Acc 

on 

Cov 

Rote 

57.4  ±8.8 

11.8 

89.5  ±  2.2 

64.9 

Bayes 

36.1  ±3.5 

70.8 

59.6  ±  2.8 

98.7 

SRV 

60,4  ±3.0 

96.6 

75.9  ±  2.6 

92.3 

Max 

59.8  ±3.0 

98.8 

75.6  ±2.5 

99.7 

Prob 

60.8  ±3.0 

98.8 

76.0  ±  2.5 

99.7 

C  Bayes 

62.5  ±  3.0 

98.8 

75.6  ±  2.5 

99.7 

1  stime 

etime  | 

Rote 

73.7  ±  2.5 

99.6 

75.1  ±3.7 

95.4 

Bayes 

98.2  ±0.7 

100.0 

96.1  ±  1.6 

99.6 

SRV 

98.6  ±0.7 

99.8 

94.1  ±2.0 

98.4 

Max 

96.6  ±  1.0 

100.0 

93.6  ±2.0 

100.0 

Prob 

99.3  ±0.5 

100.0 

95.4  ±  1.7 

100.0 

C  Bayes 

99.3  ±0.5 

100.0 

96.3  ±  1.6 

100.0 

linear  regression.  Low-confidence  predictions  tend  to 
be  more  frequent  than  high-confidence  ones,  obeying 
something  like  Zipf’s  Law. 

With  each  prediction,  we  use  the  two  models  associ¬ 
ated  with  a  learner  to  adjust  the  posterior  probabilities 
of  the  two  mutually  exclusive  hypotheses  regarding  the 
affected  fragment,  always  normalizing  so  they  sum  to 
1. 

4  EXPERIMENTS 

We  experimented  with  data  from  two  IE  domains.  One 
consists  of  485  postings  to  electronic  bulletin  boards, 
which  describe  upcoming  seminars  in  a  university  en¬ 
vironment.  The  earliest  of  these  announcements  dates 
to  October,  1982;  the  most  recent  was  posted  in  Au¬ 
gust,  1995.  We  manually  tagged  these  announcements 
for  four  fields:  speaker,  location,  stime  (start  time), 
and  etime  (end  time).  The  other  domain  is  a  collec¬ 
tion  of  600  newswire  articles  on  corporate  acquisitions 
from  the  Reuters  data  set  (Lewis,  1992).  We  defined 
nine  fields  for  this  domain  and  manually  annotated  the 
collection  to  identify  all  instances  of  them.  We  selected 
five  of  the  fields  for  these  experiments:  acquired  (the 
official  name  of  the  company  or  resource  that  is  being 
purchased),  purchaser,  aeqabr  (the  short  name  for 
acquired  used  in  the  body  of  the  article),  purchabr, 
and  dlramt  (the  price  paid). 

The  performance  numbers  w'e  report  here  are  the  re¬ 
sult  of  five-fold  experiments  in  each  domain.  In  each 
iteration  the  datasets  were  randomly  divided  into  two 
partitions  of  equal  size.  One  partition  was  used  for 
training,  the  other  for  testing. 
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Table  2:  Accuracy-Coverage  Results  for  the  Acquisi¬ 
tion  Domain. 


acquii 

Acc 

ed 

Cov 

purcha 

Acc 

ser 

Cov 

Rote 

56.1  ±5.6 

20.5 

47.5  ±5.6 

22.3 

Bayes 

22.4  ±2.2 

96.4 

41.4  ±2.6 

99.7 

SRV 

41.1  ±2.6 

96.0 

49.7  ±2.7 

97.8 

Max 

43.4  ±  2.5 

99.8 

51.4  ±2.7 

99.9 

Prob 

45.0  ±2.5 

99.8 

53.2  ±2.7 

99.9 

C  Bayes 

45.8  ±2.5 

99.8 

54.7  ±2.6 

99.9 

acqabr 

purchabr  | 

Rote 

31.7  ±4.2 

43.8 

24.7±4.1 

38.5 

Bayes 

33.1  ±  2.8 

99.7 

52.0  ±  2.9 

99.9 

SRV 

45.0  ±3.0 

99.8 

54.0  ±2.9 

99.6 

Max 

42.7  ±2.9 

100.0 

57.4  ±2.9 

100.0 

Prob 

47.6  ±3.0 

100.0 

61.0  ±  2.9 

100.0 

C  Bayes 

47.2  ±3.0 

100.0 

60.0  ±  2.9 

100.0 

dlramt  | 

Rote 

77.2  ±4.7 

48.1 

Bayes 

62.2  ±4.3 

76.9 

“SRV 

74.4  ±3.5 

90.1 

Max 

72.0  ±3.5 

95.5 

“ProB 

73.1  ±3.5 

95.5 

C  Bayes 

70.2  ±3.5 

95.5 

Table  3:  FI  Scores.  Two  scores  are  shown  for  each 
result:  Full,  the  FI  score  for  the  accuracy-coverage  re¬ 
sults  reported  in  Tables  1  and  2,  and  Peak,  the  highest 
FI  score  along  the  full  accuracy-coverage  curve. 


spe 

Full 

aker 

Peak 

loc; 

Full 

it  ion 
Peak 

St 

Full 

ime 

Peak 

Rote 

19.6 

84.7 

84.7 

Bayes 

47.8 

48.0 

IjfeliJIW 

SRV 

74.3 

74.3  , 

83.3 

83.3 

Max 

74.5 

74.5 

86.0 

98.3 

Prob 

75.3 

75.3 

86.3 

86.6 

99.6 

99.6 

C  Bayes 

76.6 

86.0 

99.6 

99.6 

etime 

|||||||^^^^^^^]|||| 

purchaser 

Rote 

84.0 

84.0 

30.3 

30.3 

Bayes 

97.8 

97.8 

59.3 

SRV 

96.2 

96.2 

66.4 

Max 

96.7 

96.7 

61.2 

67.9 

68.0 

Prob 

97.6 

62.7 

69.5 

C  Bayes 

98.1 

63.2 

acqabr  | 

purchabr  | 

dlreimt  I 

Rote 

36.8 

37.2 

30.1 

30.8 

59.3 

59.3 

Bayes 

49.7 

52.8 

68.4 

68.6 

68.8 

68.8 

SRV 

62.0 

62.0 

70.0 

70.2 

81.5 

81.5 

Max 

59.8 

59.8 

72.9 

72.9 

82.1 

82.1 

Prob 

64.5 

64.5 

75.8 

75.8 

82.8 

82.8 

C  Bayes 

64.1 

64.1 

75.0 

75.0 

80.9 

80.9 

A  third  of  the  training  set,  randomly  selected,  was  set 
aside  for  validation.  Each  learner  was  trained  on  the 
remaining  two-thirds,  and  tested  on  the  validation  set. 
Following  this  validation  step,  each  learner  was  again 
trained  on  the  entire  training  set  and  tested  on  the  test 
set.  The  goal  of  the  combining  methods  was  to  use 
performance  results  on  the  validation  set  to  arbitrate 
among  predictions  on  the  test  set. 

The  performance  of  all  methods  is  summarized  in  Ta¬ 
ble  1,  for  the  seminar  announcement  fields,  and  Ta¬ 
ble  2,  for  the  acquisition  fields.  The  unit  of  measure¬ 
ment  here,  as  elsewhere  in  this  paper,  is  a  document. 
When  assessing  a  learner’s  performance  for  a  single 
document,  we  can  distinguish  among  four  basic  out¬ 
comes:  no  prediction  from  the  learner,  prediction  on  a 
document  lacking  a  field  instance  (spurious),  top  pre¬ 
diction  is  incorrect  (wrong),  and  top  prediction  is  cor¬ 
rect  (correct).  The  coverage  column  (Cov)  shows  for 
what  fraction  of  those  documents  containing  a  field 
instance  a  learner  actually  made  a  prediction.  The 
number  in  the  accuracy  column  (Acc)  shows  the  frac¬ 
tion  of  correct  predictions  over  documents  for  which 
the  learner  made  a  prediction  and  which  contained 
a  field  instance,  i.e.,  it  ignores  spurious  predictions. 
Note  that  if  any  single  learner  makes  a  spurious  pre¬ 
diction,  all  combining  methods  also  make  one,  since 
they  are  limited  to  ordering  the  predictions  made  by 
actual  learners.  Thus,  counting  spurious  predictions 
as  errors,  while  generally  appropriate,  tends  to  obscure 
the  differences  between  the  learners  and  the  combining 
methods. 

Both  the  accuracy  and  coverage  values  should  be 
considered  together.  There  are  cases,  for  example, 
where  the  accuracy  number  makes  Rote  look  like  the 
strongest  extraction  method.  Its  accuracy,  however, 
is  usually  measured  over  a  much  smaller  number  of 
documents.  While  it  can  typically  recognize  a  fraction 
of  field  instances  with  reasonable  accuracy  (especially 
locations),  it  does  not  stand  up  well  to  overall  com¬ 
parison  with  the  other  learners.  For  convenience  in 
comparing  systems,  it  is  common  in  information  re¬ 
trieval  and  information  extraction  to  combine  preci¬ 
sion  and  recall  into  a  single,  summary  number,  called 
the  F-measure: 

(/?^  +  1.0)PR 
(P^P)  +  R 

The  parameter  /?  determines  how  much  to  favor  recall 
over  precision.  Researchers  in  information  extraction 
frequently  report  the  FI  score  of  a  system  (/?  =  1), 
which  weights  precision  and  recall  equally.  We  can  do 
the  same  with  our  accuracy-coverage  results.  Table  3 
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Figure  3;  Plots  of  accuracy  vs.  coverage  for  all  meth¬ 
ods  on  two  fields,  speaker  and  purchaser. 

shows  the  FI  scores  for  all  learners  and  fields. 

For  the  purchabr  field  there  is  clear  statistical  separa¬ 
tion  between  the  best  individual  learner  (SRV)  and  the 
top  two  combining  methods  (Prob  and  CBayes).  Note, 
as  Table  3  makes  clear,  that  even  in  the  cases  where 
the  difference  is  less  apparent,  the  combining  meth¬ 
ods  tend  to  outperform  the  best  individual  method  at 
higher  coverage  levels.  Among  the  three  combining 
methods  there  is  not  one  case  of  statistical  separation, 
but  across  all  fields  a  clear  picture  emerges  in  which 
Prob  and  CBayes  are  better  than  Max.  Note  that  even 
in  cases  where  a  combining  method  performs  only  as 
well  as  the  best  individual  learner,  it  has  served  a  valu¬ 
able  purpose — that  of  relieving  us  of  the  requirement 
of  choosing  a  single  learner.  If  a  combining  method 
can  do  this  in  most  cases,  while  providing  added  value 
in  a  few,  we  account  it  a  clear  success. 

Perhaps  more  interesting  than  summary  statistics  are 


Table  4:  Overlap  in  Learner  Behavior  for  the  Speaker 
Field.  Numbers  are  the  probability  that  column 
learner  predicted  correctly,  given  that  the  row  learner 
predicted  correctly. 


Rote 

Bayes 

SRV 

Max 

Prob 

CBayes 

Rote 

1 

0.81 

0.81 

0.90 

0.96 

0.97 

Bayes 

0.22 

1 

0.68 

0.86 

0.89 

0.86 

SRV 

0.09 

0.30 

1 

0.93 

0.93 

Max 

0.10 

0.37 

0.92 

1 

0.99 

Prob 

0.11 

0.38 

0.90 

0.98 

1 

0.98 

CBayes 

0.11 

0.35 

0.92 

0.93 

0.95 

1 

accuracy-coverage  (similar  to  precision-recall)  graphs. 
Each  point  x  along  the  horizontal  axis  represents  the 
x%  most  confident  predictions.  The  vertical  value  at 
this  point  is  the  accuracy  of  these  predictions.  If  the 
accuracy-coverage  curve  declines  monotonically,  it  sug¬ 
gests  that  the  learner’s  confidence  correlates  well  with 
actual  accuracy. 

Figure  3  shows  the  accuracy-coverage  curves  for  all 
methods  on  two  of  the  fields.  The  speaker  and 
purchaser  fields  are  the  ones  for  which  CBayes  docs 
best.  These  graphs  make  clear  what  the  summary 
statistics  cannot:  That  combining  learners  allows  us  to 
make  better  accuracy-coverage  judgments  than  we  can 
with  a  single  learner.  The  anomalous  high-confidence 
behavior  of  Prob  and  Max  in  the  purchaser  curve 
may  be  due  to  an  over-reliance  on  Rote,  which  has 
similar  behavior.  Note  that  the  high-confidence  (low- 
coverage)  end  of  the  curve  is  the  part  with  the  least 
statistical  certainty.  Also,  although  CBayes  appears 
better  than  any  individual  learner,  an  examination  of 
the  graphs  for  all  fields  does  not  support  a  preference 
of  it  over  Prob,  or  vice  versa.  There  are  cases  where 
CBayes  has  high-confidence  difficulties  similar  to  those 
shown  here  for  Prob  and  Max.  We  believe  that  better 
regression  models  will  mitigate  some  of  these  phenom¬ 
ena. 

The  strength  of  a  meta-learning  approach  depends  on 
the  mutual  independence  of  the  constituent  learners. 
Table  4  shows  where  some  of  the  power  of  combining 
learners  comes  from  on  the  speaker  field,  a  relatively 
challenging  task.  In  this  table  we  ask  the  que.ition, 
given  that  Learner  A  has  predicted  correctly  on  some 
document,  what  is  the  probability  that  Learner  B  will 
also  predict  correctly?  The  number  in  entry  (i,  j)  is  the 
fraction  of  all  documents  correctly  handled  by  method 
i  which  method  j  also  correctly  handled.  Based  on 
this  table,  it  is  evident  that  Rote  and  Bayes  are  more 
closely  related  to  each  other  than  either  to  SRV. 
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The  column  for  a  combining  method  allows  us  to  infer 
which  learners  it  depends  on  most  for  its  performance. 
It  appears  from  this  that  all  three  methods  rely  more 
on  Rote  than  on  Bayes.  We  would  hope  to  see  this, 
based  on  Figure  3,  since  the  few  Rote  predictions  that 
are  available  for  this  field  tend  to  have  higher  accuracy 
than  most  Bayes  predictions.  It  is  also  gratifying  that 
all  methods  appear  to  rely  heavily  on  SRV,  since  it  is 
the  best  individual  learner  in  this  case. 


5  CONCLUSION 

The  experimental  results  presented  here  show  that 
multistrategy  learning  can  be  useful  for  the  problem  of 
information  extraction.  We  present  one  form  of  multi¬ 
strategy  learning,  in  which  the  component  learners  are 
treated  as  black  boxes  and  only  their  reliability,  as  a 
function  of  confidence,  is  modeled.  Nothing  in  the  ba¬ 
sic  framework  requires  the  information  extraction  set¬ 
ting  or  makes  any  assumptions  about  the  number  or 
structure  of  component  learners.  It  is  only  necessary 
that  learners  be  instrumented  to  associate  a  confidence 
with  any  prediction  they  make,  something  which  is  al¬ 
ready  part  of  the  design  of  many  learners,  and  which 
can  be  readily  added  to  others. 

We  do  not  claim  that  the  multistrategy  results  re¬ 
ported  here  are  the  best  that  can  be  achieved.  Many 
details  remain  to  be  filled  in,  such  as  how  best  to  con¬ 
duct  validation  and  which  statistical  assumptions  are 
appropriate.  We  have  experimented  with  two  kinds  of 
regression  to  model  learner  reliability,  but  would  not 
be  surprised  if  other  methods  which  we  have  not  tried, 
such  as  logistic  regression  or  a  simple  neural  network, 
might  afford  increased  accuracy.  We  regard  this  as 
future  work. 

It  also  remains  to  be  seen  how  these  results  might  be 
fit  into  a  more  traditional  information  extraction  set¬ 
ting,  in  which  slot  filling  is  performed  as  part  of  a 
larger  system  and  as  one  of  several  interacting  tasks. 
Still,  the  approaches  described  here  are  immediately 
applicable  to  a  number  of  unconventional  information 
extraction  problems.  And  we  can  begin  to  see  how 
information  extraction  from  ungrammatical  text,  and 
other  “natural”  problems  admitting  multiple  abstract 
representations,  can  be  addressed  with  machine  learn¬ 
ing  methods. 
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Abstract.  The  problem  of  combining  preferences  arises  in  sev¬ 
eral  applications,  such  as  combining  the  results  of  different  search 
engines.  This  work  describes  an  efficient  algorithm  for  combin¬ 
ing  multiple  preferences.  We  first  give  a  formal  framework  for  the 
problem.  We  then  describe  and  analyze  a  new  boosting  algorithm 
for  combining  preferences  called  RankBoost.  We  also  describe  an 
efficient  implementation  of  the  algorithm  for  a  restricted  case.  We 
discuss  two  experiments  we  carried  out  to  assess  the  performance 
of  RankBoost.  In  the  first  experiment,  we  used  the  algorithm  to 
combine  different  WWW  search  strategies,  each  of  which  is  a 
query  expansion  for  a  given  domain.  For  this  task,  we  compare 
the  performance  of  RankBoost  to  the  individual  search  strategies. 
The  second  experiment  is  a  collaborative-filtering  task  for  mak¬ 
ing  movie  recommendations.  Here,  we  present  results  comparing 
RankBoost  to  nearest-neighbor  and  regression  algorithms. 


1  Introduction 

Consider  the  following  movie-recommendation  task,  some¬ 
times  called  a  “collaborative-filtering”  problem  [8,  14].  In 
this  task,  a  new  user,  Alice,  seeks  recommendations  of 
movies  that  she  is  likely  to  enjoy.  A  collaborative-filtering 
system  first  asks  Alice  to  rank  movies  that  she  has  already 
seen.  The  system  then  examines  the  rankings  of  movies 
provided  by  other  viewers  and  uses  this  information  to  re¬ 
turn  to  Alice  a  list  of  recommended  movies.  To  do  that,  the 
recommendation  system  looks  for  users  whose  preferences 
are  similar  to  those  of  Alice  and  combines  their  recommen¬ 
dations. 

One  important  property  of  this  problem  is  that  the  most 
relevant  information  to  be  combined  represents  relative 
preferences  rather  than  absolute  ratings.  In  other  words, 
even  if  the  ranking  of  movies  is  expressed  by  assigning 
each  movie  a  numeric  score,  we  would  like  to  ignore  the 
absolute  values  of  these  scores  and  concentrate  only  on 
their  relative  order.  This  distinction  becomes  very  impor¬ 
tant  when  we  combine  the  rankings  of  many  users  who 
often  use  completely  different  ranges  of  scores  to  express 
identical  preferences.  Situations  where  we  need  to  combine 
the  ranking  of  different  models  also  arise  in  meta-searching 
problems  [5]  and  in  information-retrieval  problems  [  1 1, 10]. 

In  this  paper,  we  introduce  and  analyze  an  efficient 
algorithm  called  RankBoost  for  combining  multiple  rank- 
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ings.  This  algorithm  is  based  on  Freund  and  Schapire’s  [6] 
AdaBoost  algorithm  and  its  recent  successor  developed  by 
Schapire  and  Singer  [13].  Similar  to  other  boosting  al¬ 
gorithms,  RankBoost  works  by  combining  many  “weak” 
rankings  of  the  given  instances.  Each  of  these  may  be  only 
weakly  correlated  with  the  target  ranking  that  we  are  at¬ 
tempting  to  approximate.  We  show  how  to  combine  such 
weak  rankings  into  a  single  highly  accurate  ranking,  and  we 
prove  a  bound  on  the  quality  of  this  final  ranking  in  terms 
of  the  quality  of  the  weak  rankings. 

For  the  movie  task,  we  use  very  simple  weak  rankings 
which  partition  all  movies  into  only  two  equivalence  sets, 
those  which  are  more  preferred  and  those  which  arc  less 
preferred.  For  instance,  we  might  use  another  user’s  ranked 
list  of  movies  partitioned  according  to  whether  or  not  he 
prefers  them  to  some  particular  movie  that  appears  on  his 
list.  Such  partitions  of  the  data  have  the  advantage  that  they 
only  depend  on  the  relative  ordering  defined  by  the  given 
rankings  rather  than  absolute  ratings.  Despite  their  appar¬ 
ent  weakness,  their  combination  using  RankBoost  performs 
quite  well  experimentally. 

Besides  giving  a  theoretical  analysis  of  the  quality  of 
the  ranking  produced  by  RankBoost,  we  also  analyze  its 
complexity  and  show  how  it  can  be  implemented  efficiently. 
We  discuss  further  improvements  in  efficiency  which  arc 
possible  in  certain  natural  cases. 

We  report  the  results  of  experimental  tests  of  our  ap¬ 
proach  on  two  different  problems.  The  first  is  the  meta- 
searching  problem.  In  a  meta-search  application,  the  goal 
is  to  combine  the  rankings  of  several  WWW  search  strate¬ 
gies.  Each  search  strategy  is  an  operation  which  takes  as 
input  a  query,  performs  some  simple  transformation  of  the 
query  (such  as  adding  search  directives  such  as  “AND”,  or 
search  tokens  such  as  “homepage”)  and  sends  it  to  a  partic¬ 
ular  search  engine.  The  outcome  of  using  each  strategy  is 
a  list  of  URLs  which  are  proposed  as  answers  to  the  query. 
The  goal  is  to  combine  the  strategies  that  work  best  for  a 
given  set  of  queries. 

The  second  problem  is  the  movie-recommendation  prob¬ 
lem  described  above.  For  this  problem,  there  exists  a 
large  publicly  available  dataset  which  contains  ratings  of 
movies  by  many  different  people.  We  compared  RankBoost 
to  nearest-neighbor  and  regression  algorithms  which  have 
been  previously  studied  for  this  application  using  several 
evaluation  measures. 
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Despite  the  wide  range  of  applications  that  use  and  com¬ 
bine  rankings,  this  problem  has  received  relatively  little  at¬ 
tention  in  the  machine-learning  community.  The  few  meth¬ 
ods  that  have  been  devised  for  combining  rankings  tend 
to  be  based  either  on  nearest-neighbor  methods  [9,  14]  or 
numerical-optimization  techniques  [1, 3].  In  the  latter  case, 
the  rankings  are  viewed  as  real-valued  scores  and  the  prob¬ 
lem  of  combining  different  rankings  reduces  to  numerical 
search  for  a  set  of  parameters  that  will  minimize  the  dis¬ 
parity  between  the  combined  scores  and  the  feedback  of  a 
user. 

While  the  above  (and  other)  approaches  might  work 
well  in  practice,  they  still  do  not  guarantee  that  the  com¬ 
bined  system  will  match  the  user’s  preference  when  we 
view  the  scores  as  a  means  to  express  preferences.  Re¬ 
cently,  Cohen,  Schapire  and  Singer  [4]  proposed  a  frame¬ 
work  for  manipulating  and  combining  multiple  rankings  in 
order  to  directly  minimize  the  number  of  disagreements.  In 
their  framework,  the  rankings  are  used  to  construct  prefer¬ 
ence  graphs  and  the  problem  is  reduced  to  a  combinatorial 
optimization  problem  which  turns  out  to  be  NP-complete; 
hence,  an  approximation  is  used  to  combine  the  different 
rankings.  They  also  describe  an  efficient  on-line  algorithm 
for  a  related  problem. 

The  algorithm  we  present  in  this  paper  uses  a  similar 
framework  to  theirs,  but  sidesteps  the  intractability  prob¬ 
lems.  Furthermore,  RankBoost  is  more  appropriate  for 
batch  settings  where  there  is  “enough”  time  to  find  a  good 
combination.  Thus,  the  two  approaches  complement  each 
other.  Together,  these  algorithms  constitute  a  viable  ap¬ 
proach  to  the  problem  of  combining  multiple  rankings,  that, 
as  our  experiments  indicate,  work  very  well  in  practice. 

2  A  formal  model  of  the  ranking  problem 

In  this  section,  we  describe  our  formal  model  for  studying 
ranking.  Let  A!  be  a  set  called  the  domain  or  instance 
space.  Elements  of  X  are  called  instances.  For  example, 
in  the  movie-ranking  task,  each  movie  is  an  instance. 

A  learning  algorithm  in  our  model  accepts  as  input  a 
set  of  ranking  features  /i , . . . ,  These  are  intended  to 
provide  a  base  level  of  information  about  the  ranking  task. 
Said  differently,  the  learner’s  job  will  be  to  learn  a  ranking 
expressible  in  terms  of  the  ranking  features,  similar  to  or¬ 
dinary  features  in  more  conventional  learning  settings.  For 
the  movie  task,  each  ranking  feature  corresponds  to  a  single 
viewer’s  past  ratings  of  movies. 

Formally,  each  ranking  feature  fi  is  a  function  of  the 
form  /i  ;  A”  — >  M.  The  set  E  consists  of  all  real  numbers, 
plus  one  additional  element  (j)  which  indicates  that  no  rank¬ 
ing  is  given  and  which  is  defined  to  be  incomparable  to  all 
real  numbers.  For  two  instances  xq  and  x\ ,  we  interpret 
fi{x\)  >  fi{xo)  to  mean  that  x\  is  ranked  higher  than  xq 
by  fi.  If  fi{x)  =  (j)  then  x  is  unranked  by  /,.  For  the  movie 
ranking  task,  fi{x)  is  simply  the  numerical  rating  provided 
by  movie- viewer  i  on  movie  x,  or  (j)  if  the  movie  was  not 
rated. 

The  final  input  to  the  learning  algorithm  is  a  feedback 
function  O.  This  function  encodes  known  relative  ranking 


information  about  a  subset  of  the  instances.  Typically,  the 
learner  will  try  to  approximate  O  to  produce  a  ranking  of 
unseen  instances.  For  the  movie  task,  the  feedback  consists 
of  the  known  movie  preferences  provided  by  the  current 
movie- viewer  (i.e.,  the  one  for  whom  the  system  is  currently 
attempting  to  recommend  movies). 

Formally,  we  assume  the  feedback  function  has  the  form 
4>  :  A”  X  A  E  with  the  interpretation  that  O(a:o,a:i) 
represents  the  degree  to  which  xi  should  be  correctly  ranked 
above  xq.  Large  positive  values  mean  that  x\  should  be 
ranked  above  xq  while  negative  values  mean  the  opposite; 
a  value  of  zero  indicates  no  preference  between  xq  and 
X|.  Consistent  with  this  interpretation,  we  assume  that 
a;)  =  0  for  all  x  £  X,  and  that  is  anti-symmetric  in 
the  sense  that  4>(a;o,a:i)  =  — 3>(a:i,a:o)  forall  a;o,a;i  G  X. 
Note,  however,  that  we  do  not  assume  transitivity  of  the 
feedback  function. 

For  the  movie  task,  we  can  define  4>(a:o,  a;i )  to  be  -f  1  if 
movie  x\  was  preferred  to  movie  a;o  by  the  current  viewer, 
—  1  if  the  opposite  was  the  case,  and  0  if  either  of  the  movies 
was  not  seen  or  if  they  were  equally  rated. 

We  generally  assume  that  the  support  of  «I)  is  finite.  Let 
Xijf  denote  the  set  of  feedback  instances,  i.e. ,  those  instances 
which  occur  in  the  support  of  d>: 

A’*  =  {a:  €  A"  I  3x'  6  A” ;  a)(a;,a;')  ^  0}. 

Also,  let  |<I>|  be  the  size  of  the  support  of  O: 

|«D|  =  |{(xo,a:,)e  A’xA’|<I>(xo,x,)^0}|. 

In  some  settings,  it  may  be  appropriate  for  the  learner  to 
accept  a  set  of  feedback  functions  $1 , . . . ,  «!>„.  However, 
all  of  these  can  be  combined  into  a  single  function  O  sim¬ 
ply  by  adding  them:  (If  some  have  greater 

importance  than  others,  then  a  weighted  sum  can  be  used.) 

Formally,  we  require  the  learner  to  output  a  ranking  of 
all  instances  represented  in  the  form  of  a  function  H  :  X 
E  with  a  similar  interpretation  to  that  of  the  ranking  features, 
i.e.,  X]  is  ranked  higher  than  xq  by  if  H'(a;i)  >  H{xq). 
For  the  movie  task,  this  corresponds  to  a  complete  ordering 
of  all  movies  (with  possible  ties  allowed). 

The  goal  of  the  learner  is  to  produce  a  “good”  ranking 
of  all  instances,  including  those  not  observed  in  training. 
For  instance,  for  the  movie  task,  we  would  like  to  find  a 
ranking  of  all  movies  which  accurately  predicts  which  ones 
a  movie-viewer  will  like  more  or  less  than  others;  obviously, 
this  ranking  should  include  movies  that  the  viewer  has  not 
already  seen.  As  in  other  learning  settings,  how  well  the 
learning  system  performs  on  unseen  data  depends  on  many 
factors,  such  as  the  number  of  instances  covered  in  training 
and  the  representational  complexity  of  the  ranking  produced 
by  the  learner. 

There  are  various  methods  that  can  be  used  to  evaluate 
such  a  ranking.  Some  of  these  are  discussed  in  Section  5. 
The  boosting  algorithm  described  in  the  next  section  at¬ 
tempts  to  minimize  one  possible  measure  called  the  ranking 
loss. 
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Given:  initial  distribution  D  over  X  x  X. 

Initialize:  D\  =  D. 

Fort  = 

•  Train  weak  learner  using  distribution  Dt- 

•  Get  weak  hypothesis  ht  :  X  -^W. 

•  Choose  at  6  E. 

•  Update: 

Dt{xo,xi)exp[at{ht{xo)  -  ht{xt))) 
Dt+\{xo,x\)  = - = - 

■Z>t 

where  Zt  is  a  normalization  factor  (chosen  so  that  Dt+i  will 
be  a  distribution). 

T 

Output  the  final  hypothesis:  H{x)  =  '^athtjx). 

t=i 

Figure  1 :  The  RankBoost  algorithm. 

3  A  boosting  algorithm  for  the  ranking  task 

In  this  section,  we  describe  an  approach  to  the  ranking  prob¬ 
lem  based  on  a  machine  learning  method  called  boosting,  in 
particular,  Freund  and  Schapire’s  [6]  AdaBoost  algorithm 
and  its  successor  developed  by  Schapire  and  Singer  [13]. 
Boosting  is  a  method  of  producing  highly  accurate  predic¬ 
tion  rules  by  combining  many  “weak”  rules  which  may  be 
only  moderately  accurate. 

In  the  current  setting,  we  seek  a  learning  algorithm 
which  will  produce  a  function  H  :  X  -y  R  whose  induced 
ordering  of  X  will  approximate  the  relative  orderings  en¬ 
coded  by  the  feedback  function  <I>.  To  formalize  this  goal, 
let  D{xo,X\)  =  c  •  max{0,<l)(a;o,a:i)}  so  that  all  negative 
entries  of  <I>  (which  carry  no  additional  information)  are  set 
to  zero.  Here,  c  is  a  positive  constant  chosen  so  that 

^  D{xo,xi)  =  1. 

xo,X] 

(When  a  specific  range  is  not  specified  on  a  sum,  we  always 
assume  summation  over  all  ofX.)  Apairxo,xi  is  said  to  be 
crucial  if  <I>(a;o,  a:i )  >  0  so  that  the  pair  receives  non-zero 
weight  under  D. 

Our  boosting  algorithm  is  designed  to  find  an  H  with  a 
small  weighted  number  of  crucial-pair  misorderings,  namely, 

^  D{xo,xi)lH{xi)  <  H{xo)l 

XOyXl 

[F(a;,)<if(a;o)].  (I) 

Here  and  throughout  this  paper,  we  define  |7r]  to  be  1  if 
predicate  tt  holds  and  0  otherwise.  We  call  the  quantity  in 
Eq.  (1)  the  ranking  loss  and  we  denote  it  by  rlosso(ff). 

3.1  The  RankBoost  algorithm 

We  call  our  boosting  algorithm  RankBoost,  and  its  pseu¬ 
docode  is  shown  in  Figure  1.  Like  all  boosting  algorithms, 
RankBoost  operates  in  rounds.  We  assume  access  to  a 
separate  procedure  called  the  weak  learner  which,  on  each 
round,  is  called  to  produce  a  weak  hypothesis.  RankBoost 
maintains  a  distribution  Dt  over  X  x  X  which  is  passed  on 
round  t  to  the  weak  learner.  This  distribution  encodes  the 
relative  importance  to  the  weak  learner  that  one  instance  is 
ranked  above  another. 


Weak  hypotheses  have  the  form  ht  ■  X  -t  R.  We  think 
of  these  as  providing  ranking  information  in  the  manner 
described  above.  The  weak  learner  we  used  in  our  exper¬ 
iments  is  based  on  the  given  ranking  features;  details  are 
given  in  Section  4. 

The  boosting  algorithm  uses  the  weak  hypotheses  to 
update  the  distribution  as  shown  in  Figure  1.  Suppose  that 
a;o,  xt  is  a  crucial  pair  so  that  we  want  X|  to  be  ranked  higher 
than  xo  (in  all  other  cases,  Dt  will  be  zero).  Assuming  for 
the  moment  that  the  parameter  at  >  0  (as  it  usually  will  be), 
this  rule  has  the  effect  of  decreasing  the  weight  Dt{xo,xi)if 
ht  gives  a  correct  ranking  (ht{x\ )  >  ht{xo))  and  increases 
the  weight  otherwise.  Thus,  Dt  will  tend  to  concentrate 
on  the  pairs  whose  relative  ranking  is  hardest  to  determine. 
The  actual  setting  of  Q/  will  be  discussed  shortly. 

The  final  or  combined  hypothesis  Ff  is  a  weighted  sum 
of  the  weak  hypotheses.  We  can  prove  the  following  bound 
on  the  ranking  loss  of  H.  This  theorem  also  provides  guid¬ 
ance  in  choosing  at  and  in  designing  the  weak  learner  as 
we  discuss  below.  Note  that  this  theorem  only  concerns 
performance  on  the  training  data.  As  in  more  standard  clas¬ 
sification  problems,  the  loss  on  a  separate  test  set  can  also  be 
theoretically  bounded  given  appropriate  assumptions  using 
uniform-convergence  theory  [2,  7,  12,  15]. 

Theorem  1  Assuming  the  notation  of  Figure  I,  the  ranking 
loss  of  H  is 

T 

t]ossd{H)  <  JJ  • 
t=i 

Proof:  Unraveling  the  update  rule,  we  have  that 

D{xo,X])e\p{H{xo)  -  H{xi)) 
Dt+\[xo,x\)  = - fr— - • 

lit 

Note  that  lx  >  0]  <  e*  for  all  real  x.  Therefore,  the  ranking 
loss  with  respect  to  initial  distribution  D  is 

^  D(xo,xi)|//(xo)  >  i/(xi)] 

Xo,Xi 

<  ^  £)(xo,X|)exp  (i/(xo)  -  i7(xi)) 

Xt),Xl 

—  ^  ■C>t+i(3:o,2:i) 

Xo.Xi  t  t 

This  proves  the  theorem.  ■ 

Note  that  RankBoost  generally  requires  0(|0|)  space 
and  time  per  round. 

3.2  Choosing  at  and  criteria  for  weak  learners 

Thus  to  minimize  ranking  loss,  on  each  round  t  we  should 
choose  at  and  construct  weak  hypotheses  ht  in  a  manner 
that  tends  to  minimize 

Zt=Yl  Dt{xQ,x\)o\p  (a,(fi((xo)  -  ht{x\)))  . 

XQyX\ 

There  are  various  methods  for  achieving  this  end.  Here  wc 
sketch  three.  Let  us  fix  t  and  drop  all  t  subscripts  when 
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clear  from  context.  (In  particular,  for  the  time  being,  D  will 
denote  Dt  rather  than  an  initial  distribution.) 

First  and  most  generally,  for  any  given  weak  hypothesis 
h,  it  can  be  shown  that  Z,  viewed  as  a  function  of  a,  has 
a  unique  minimum  which  can  be  found  numerically  via  a 
simple  binary  search  (except  in  trivial  degenerate  cases). 
Details  are  omitted. 

The  second  method  of  minimizing  Z  is  applicable  in 
the  special  case  that  h  has  range  {0, 1 } .  In  this  case,  we  can 
minimize  Z  analytically  as  follows;  For  b  G  {-1,0,+!}, 
let 

Wft  =  ^  D{xo,xi)lh{xo)  -  h(xi)  =  bj. 

XOyXl 

Also,  abbreviate  W+i  by  W+  and  W-i  by  W-.  Then  Z  — 
VF_e““  +  Wo  +  W+e“.  Using  simple  calculus,  it  can  be 
verified  that  Z  is  minimized  by  setting  a  =  j  In  (W_  /  W+) 
which  yields  Z  =  Wb  +  2^W_W+.  Thus,  if  we  are  using 
weak  hypotheses  with  range  restricted  to  {0, 1},  we  should 
attempt  to  find  h  which  tends  to  minimize  this  value  of  Z 
and  we  should  then  set  a  accordingly. 

For  weak  hypotheses  with  range  [0, 1],  we  can  use  a 
third  method  based  on  an  approximation  of  Z.  Specifically, 
note  that 


for  all  real  a  and  x  e  [- 1,  +1].  Thus,  we  can  approximate 
Zby 


Z  <  5]D(xo,x,) 


+  h{xo)  -  h{xi 


+ 


2 

^l-/l(xo)  +  /t(xi)^ 


where 


r=Yl  D{xQ,xi){h{xi)  -  h{xo)). 


xo,xi 


The  right  hand  side  of  Eq.  (2)  is  minimized  when 


(3) 


Given:  disjoint  subsets  Xo  and  Xi  of  X. 
Initialize:  v\[x)  =  (|Ao|  |Xi|)“'/^ ; 

+1  if®eXi 
ifxGXo 

Forf  =  1,...,T: 


•  Train  weak  learner  using  distribution  Dt  (as  defined  by 
Eq.  (5)) 

•  Get  weak  hypothesis  /it  :  -T 

•  Choose  at  G  M. 

vtix)e\p{-at  six)  ht{x)) 

•  Update:  vt+u®)  = - - - 


where  Zt  = 


(y^  vtjx) exp(at/tt(a:))|  f  vt(a:)exp(-at/it(x))J 

/  \x6Xi  / 

T 

Output  the  final  hypothesis:  Hix)  =  athtix). 


Figure  2:  A  more  efficient  version  of  RankBoost  for  bipartite 
feedback. 

any  other  pairs.  That  is,  formally,  for  all  xq  G  Xq  and  all 
xi  G  Xi  we  have  that  O(xo,a:i)  =  +1,  O(a:i,a;o)  =  -1 
and  «1>  is  zero  on  all  other  pairs. 

Such  feedback  arises  naturally,  for  instance,  in  docu¬ 
ment  rank-retrieval  tasks  common  in  the  field  of  informa¬ 
tion  retrieval.  Here,  a  set  of  documents  may  have  been 
judged  to  be  relevant  or  irrelevant,  and  the  goal  is  to  find  a 
ranking  of  all  documents  which  will  tend  to  rank  all  rele¬ 
vant  documents  above  all  irrelevant  documents.  A  feedback 
function  which  encodes  these  preferences  will  be  bipartite. 

If  RankBoost  is  implemented  naively  as  in  Section  3.2, 
then  the  space  and  time-per-round  requirements  will  be 
0(|Xo|  l-X"!]).  In  this  section,  we  show  how  this  can  be 
improved  to  OdA'ol  +  l-X’il)-  Note  that,  in  this  section, 
X^  =  Xq  U  Xj  . 

The  main  idea  is  to  maintain  a  set  of  weights  vt  over 
X  (rather  than  the  two-argument  distribution  Dt),  and  to 
maintain  the  condition  that,  on  each  round, 

Dt{xo,xi)  =  vtixo)vtixi)  (5) 


which,  plugging  into  Eq.  (2),  yields  Z  <  \/l  —  r^.  Thus, 
to  approximately  minimize  Z  using  weak  hypotheses  with 
range  [0, 1],  we  can  attempt  to  maximize  |r|  as  defined  in 
Eq.  (3)  and  then  set  a  as  in  Eq.  (4).  This  is  the  method  used 
in  our  experiments. 

3.3  An  efficient  implementation  for  bipartite 
feedback 

In  this  section,  we  describe  a  more  efficient  implementation 
of  RankBoost  for  feedback  of  a  special  form.  We  say  that 
the  feedback  function  is  bipartite  if  there  exists  disjoint 
subsets  Xo  and  Xi  of  X  such  that  d>  ranks  all  instances 
in  Xi  above  all  instances  in  Xq  and  says  nothing  about 


for  all  crucial  pairs  xo,xi  (recall  that  Dt  is  zero  for  all  other 
pairs). 

The  pseudocode  for  this  implementation  is  shown  in 
Figure  2.  Eq.  (5)  can  be  proved  by  induction.  Details 
omitted  for  lack  of  space. 

Finally,  note  that  all  space  requirements  and  all  per- 
round  computations  are  O  (|Xo|  -f-  |Xi  |) ,  with  the  possible 
exception  of  the  call  to  the  weak  learner.  However,  if  we 
want  the  weak  learner  to  maximize  jr]  as  in  Eq.  (3),  then 
we  also  only  need  to  pass  |  X®]  weights  to  the  weak  learner, 
all  of  which  can  be  computed  in  linear  time.  Omitting  t 
subscripts,  and  defining  s()  as  in  Figure  2,  we  can  rewrite 
r  as 

r  =  Dixo,xi){hixi)  -  hjxo)) 

XQjXl 
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=  Y2  ‘^i^o)v{x\)  {h{xi)s{xi)  +  h{xo)s{xo)) 

xq^Xq  x\ GXi 

=  Y2  [^(^o)  I  s(a;o)/i(3:o) 

®o€Xo  \  xiGXi  / 


+  IZ  5Z  v{xo)  I  s{xi)h{xi) 

xi&X\  \  xo^Xo  / 

=  ^  d(a:)s(x)/i(x)  (6) 

X 

whered(a:)  =  v{x)  'Z.x':s{x)^s{x')  All  of  the  weights 
d(x)  can  be  computed  in  linear  time  by  first  computing  the 
sums  which  appear  in  this  equation  for  the  two  possible 
cases  that  a:  is  in  Xq  or  ^i.  Thus,  we  only  need  to  pass 
I  A®!  weights  to  the  weak  learner  in  this  case  rather  than  the 
full  distribution  Dt  of  size  |Xo|  [Xi  |. 


4  Weak  hypotheses  for  ranking 


As  described  in  Section  3,  our  algorithm  RankBoost  re¬ 
quires  access  to  a  weak  learner  to  produce  weak  hypotheses. 
In  this  section,  we  describe  an  efficient  implementation  of 
a  weak  learner  for  ranking. 

Perhaps  the  simplest  and  most  obvious  weak  learner 
would  find  a  weak  hypothesis  h  which  is  equal  to  one  of  the 
ranking  features  /j,  except  on  unranked  instances.  That  is. 


if  /i(x)  €  E 
if/i(x)  =  0 


for  some  €  E. 

The  main  problem  with  such  a  weak  learner  is  that  it 
depends  critically  on  the  actual  values  defined  by  the  rank¬ 
ing  features,  rather  than  relying  exclusively  on  the  relative¬ 
ordering  information  which  they  provide.  We  believe  that 
learning  algorithms  of  the  latter  form  will  be  much  more 
general  and  applicable.  Such  methods  can  be  used  even 
when  features  provide  only  an  ordering  of  instances  and 
no  scores  or  other  information  are  available.  Such  meth¬ 
ods  also  side-step  the  issue  of  combining  ranking  features 
whose  associated  scores  have  different  semantics  (such  as 
the  different  scores  assigned  to  URL’s  by  different  search 
engines). 

For  these  reasons,  we  focus  in  this  section  and  in  our 
experiments  on  {0,  l}-valued  weak  hypotheses  which  use 
the  ordering  information  provided  by  the  ranking  features, 
but  ignore  specific  scoring  information.  In  particular,  we 
will  use  weak  hypotheses  h  of  the  form 

r  1  if  fi{x)  >  9 

/i(x)=-^  0  if  fi{x)<  9  (7) 

I  9dcf  if  fi{x)  =  (p 


where  0  G  E  and  g^cf  €  {0, 1}.  That  is,  a  weak  hypothesis 
is  derived  from  a  ranking  feature  fi  by  comparing  the  score 
of  fi  on  a  given  instance  to  a  threshold  9.  To  in.stances 
left  unranked  by  f,,  the  weak  hypothesis  assigns  the  default 
score  gdef.  For  the  remainder  of  this  section,  we  show  how 
to  choose  the  “best”  feature,  threshold  and  default  score. 


Let  us  fix  t  and  drop  it  from  all  subscripts  to  simplify 
the  notation.  Since  the  ranges  of  our  weak  hypotheses  are 
bounded  in  [0, 1],  we  can  use  the  third  method'  described  in 
Section  3.2  to  guide  us  in  our  search  for  a  weak  hypothesis. 
Recall  that,  according  to  this  method,  the  weak  learner 
should  seek  a  weak  hypothesis  which  maximizes  |r|  as 
given  by  Eq.  (3).  For  a  given  candidate  weak  hypothesis, 
we  can  compute  r  directly  in  Od*!*])  time.  Moreover,  for 
each  of  the  n  ranking  features,  there  are  at  most  |A’(i)|  +  1 
thresholds  to  consider  (as  defined  by  the  range  of  fi  on  A®) 
and  two  possible  default  scores  (0  and  1).  Thus,  naively,  |r| 
can  be  maximized  in  0(n|O||A’(D|)  time.  We  now  describe  a 
time  and  space  efficient  algorithm  for  maximizing  |r  |  which 
requires  only  0(n|A’<t|  -f  |<I)|)  time.  (In  case  of  bipartite 
feedback,  if  the  boosting  algorithm  of  Section  3.3  is  used, 
only  0(n|A’(D|)  time  is  needed.) 

We  begin  by  rewriting  r  for  a  given  D  and  h  as  follows: 

r  =  D{xo,xi){h{x^)  -  h{xo)) 

X0*X\ 

=  Y  ^(^0,  x\ )h{xi )  -  Y  ^(^0,  x\ )h(xo) 

Xo,Xi  Xo,Xl 

X  x'  X  x' 

X  z' 

=  ^/i(x)7r(x),  (8) 

X 

where  we  define  7r(x)  =  X)I'(■^(x^x)  -  D{x,x'))  as  the 
potential  of  x.  Note  that  7r(x)  depends  only  on  the  current 
distribution  D.  Hence,  the  weak  learner  can  precompute 
all  the  potentials  at  the  beginning  of  each  boosting  round 
in  0(|0|)  time  and  0(|A’o|)  space.  When  the  feedback  is 
bipartite,  comparing  Eqs.  (6)  and  (8),  we  sec  that  7r(x)  = 
d{x)s(x)  where  d  and  s  are  defined  in  Section  3.3;  thus,  in 
this  case,  tt  can  be  computed  even  faster  in  only  0(|A’(i)|) 
time. 

Now  let  us  address  the  problem  of  finding  a  good  thresh¬ 
old  value  9  and  default  value  g^cf-  We  need  to  scan  the 
candidate  ranking  features  fi  and  evaluate  Ir]  (defined  by 
Eq.  (8))  for  each  possible  choice  of  fi,  9  and  g^cf.  For  h 
defined  by  Eq.  (7),  we  have  that 

r  =  Y  Y2 

x.fi(x)>6  x:fiix)  =  <p 

For  a  fixed  ranking  feature /i,  let  Af/j  =  {xeA’cD  |  fi{x) 

(j)}  be  the  set  of  feedback  instances  ranked  by  fi.  We 
only  need  to  consider  \Xj.  \  +  1  threshold  values,  namely, 
{fi{x)  I  X  G  Xf. }  U  {oo}  since  these  define  all  possible 
behaviors  on  the  feedback  instances.  Moreover,  we  can 
straightforwardly  compute  the  first  term  of  Eq.  (9)  for  all 
thresholds  in  this  set  in  time  0[\Xf.  |)  simply  by  scanning 

'Although  the  second  method  could  have  been  u.sed,  we  chose 
to  focus  on  the  third  method  because  it  is  slightly  simpler.  Exper¬ 
iments  using  the  second  method  are  in  our  future  plans. 
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ML  Domain 

Top 

1 

Top 

2 

Top 

5 

Top 

10 

Top 

20 

Top 

30 

Avg 

Rnk 

RankBoost 

102 

144 

173 

184 

194 

202 

4.38 

Best  (Top  1) 

117 

137 

154 

167 

177 

181 

6.80 

Best  (Top  10) 

112 

147 

172 

179 

185 

187 

5,33 

Best  (Top  30) 

95 

129 

159 

178 

187 

191 

5.68 

University  Domain 

RankBoost 

95 

141 

197 

215 

247 

263 

7.74 

Best  single  query 

112 

144 

198 

221 

238 

247 

8.17 

Table  1:  Comparison  of  the  combined  hypothesis  and  individual 
search  templates. 

down  a  presorted  list  of  threshold  values  and  maintaining 
the  partial  sum  in  the  obvious  way. 

For  each  threshold,  we  also  need  to  evaluate  |r|  for  the 
two  possible  assignments  of  ^def  (0  or  1).  To  do  this,  we 
simply  need  to  evaluate  once.  Naively, 

this  takes  0(|-T®  -  Xf^\)  time,  i.e.,  linear  in  the  number 
of  unranked  instances,  would  prefer  all  operations  to 
depend  instead  on  the  number  of  ranked  instances  since, 
in  applications  such  as  meta-searching  and  information  re¬ 
trieve,  each  ranking  feature  may  rank  only  a  small  fraction 
of  the  instances.  To  do  this,  note  that  7r(a:)  =  0  by 
definition  of  7r(a;).  This  implies  that 

^  7r(a:)  =  -  ^  7r(x). 

x:fi(x)=<i)  x\fi{x)-^<t> 

The  right  hand  side  of  this  equation  can  clearly  be  computed 
inO{\Xf^\)  time. 

Aus,  for  a  given  ranking  feature,  the  total  time  required 
to  evaluate  |r|  for  all  candidate  weak  hypotheses  is  only 
linear  in  the  number  of  instances  that  are  ranked  by  that 
feature. 

5  Experimental  evaluation  of  RankBoost 

In  this  section,  we  report  experiments  with  RankBoost  on 
two  ranking  problems.  The  first  is  a  simplified  Web  meta¬ 
search  task,  the  goal  of  which  was  to  build  a  search  strategy 
for  finding  homepages  of  machine-learning  researchers  and 
universities.  The  second  task  is  a  collaborative-filtering 
problem  of  making  movie  recommendations  for  a  new  user 
based  on  the  preferences  of  previous  users. 

In  each  experiment,  we  divided  the  available  data  into 
training  data  and  test  data,  ran  each  algorithm  on  the  training 
data,  and  evaluated  the  output  hypothesis  on  the  test  data. 
Details  are  given  below. 

5.1  Meta-search  task 

We  first  present  experiments  on  learning  to  combine  the 
results  of  several  Web  searches.  This  problem  exhibits 
many  facets  that  require  a  general  approach  such  as  ours. 
For  instance,  approaches  that  learn  to  combine  similarity 
scores  are  not  applicable  since  the  similarity  scores  of  Web 
search  engines  are  often  unavailable. 

In  order  to  test  RankBoost  on  this  task,  we  used  the 
data  of  Cohen,  Schapire  and  Singer  [4].  Their  goal  was  to 
simulate  the  problem  of  building  a  domain-specific  search 
engine.  As  test  cases,  they  picked  two  fairly  narrow  classes 


of  queries — retrieving  the  homepages  of  machine-learning 
researchers  (ML),  and  retrieving  the  homepages  of  uni¬ 
versities  (UNIV).  They  chose  these  test  cases  partly  be¬ 
cause  the  feedback  was  readily  available  from  the  Web. 
They  obtained  a  list  of  machine-learning  researchers,  iden¬ 
tified  by  name  and  affiliated  institution,  together  with  their 
homepages,^  and  a  similar  list  for  universities,  identified  by 
name  and  (sometimes)  geographical  location  from  Yahoo! 
We  refer  to  each  entry  on  these  lists  (i.e.,  a  name-affiliation 
pair  or  a  name-location  pair)  as  a  base  query.  The  goal  is 
to  learn  a  meta-search  strategy  which,  given  a  base  query, 
will  generate  a  ranking  of  U^’s  that  includes  the  correct 
homepage  at  or  close  to  the  top. 

Cohen,  Schapire  and  Singer  also  constructed  a  series  of 
special-purpose  search  templates  for  each  domain.  Each 
template  specifies  a  query  expansion  method  for  converting 
a  base  query  into  a  likely  seeming  AltaVista  query  which  we 
call  the  expanded  query.  For  example,  one  of  the  templates 
has  the  form +" NAME"  +machine  +learning  which 
means  that  AltaVista  should  search  for  all  the  words  in 
the  person’s  name  plus  the  words  ‘machine’  and  ‘learning’. 
When  applied  to  the  base  query  ‘Joe  Researcher  from  Learn¬ 
ing  University’  this  template  expands  to  the  expanded  query 
+"Joe  Researcher"  +machine  +learning. 

A  total  of  16  search  templates  were  used  for  the  ML 
domain  and  22  for  the  UNIV  domain.  Each  search  template 
was  used  to  retrieve  the  top  thirty  ranked  documents.  If 
none  of  these  lists  contained  the  correct  homepage,  then 
the  base  query  was  discarded  from  the  experiment.  In  the 
ML  domain,  there  were  210  base  queries  for  which  at  least 
one  search  template  returned  the  correct  homepage;  for  the 
UNIV  domain  there  were  290  such  base  queries. 

It  is  instructive  to  see  how  this  ranking  problem  can  be 
mapped  into  our  framework.  Formally,  the  instances  now 
are  all  pairs  of  the  form  (g,  u)  where  g  is  a  base  query  and 
u  is  one  of  the  URL’s  returned  by  one  of  the  search  tem¬ 
plates  for  this  query.  Each  ranking  feature  fi  is  constructed 
from  a  corresponding  search  template  i  by  assigning  the  jth 
URL  u  on  its  list  (for  base  query  g)  a  rank  of  —j;  that  is, 
fi{{q,u))  =  -j.  If  u  was  not  ranked  for  this  base  query, 
then  we  set  fi{{q,u))  =  4>.  We  also  construct  a  separate 
feedback  function  «I>,  for  each  base  query  g  which  ranks 
the  correct  homepage  URL  u*  above  all  others.  That  is, 
^g((9.w)>(Q'.w*))  =  +1  and3>5((g,u,),(g,u))  =  -Ifor 
all  u  ^  u,.  All  other  entries  of  are  set  to  zero.  All 
the  feedback  functions  d*,  were  then  combined  into  one 
feedback  function  $  by  summing  as  described  in  Section  2. 

Given  this  mapping  of  the  ranking  problem  into  our 
framework,  we  can  immediately  apply  RankBoost.  This 
mapping  implies  that  each  weak  hypothesis  is  defined  by  a 
search  template  i  (corresponding  to  ranking  feature  /»),  and 
a  threshold  value  9.  Given  a  base  query  g  and  a  URL  u, 
this  weak  hypothesis  outputs  1  or  0  if  u  is  ranked  above  or 
below  the  threshold  9  on  the  list  of  URL’s  returned  by  the 
expanded  query  associated  with  search  template  i  applied  to 
base  query  g.  As  usual,  the  final  hypothesis  ff  is  a  weighted 


^From  ‘http://www.aic.nrl.navy.mil/~aha/research/machine- 
leaming.htmT. 
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Figure  3:  Performance  of  algorithms  with  respect  to  feature  sets  of  sizes  100, 200, 500, 750, 1000, 2000. 


sum  of  the  weak  hypotheses.  Thus,  given  a  test  base  query 
q,  we  first  form  all  of  the  expanded  queries  and  send  these  to 
the  search  engine  to  obtain  lists  of  URL’s.  We  then  evaluate 
H  as  above  on  each  pair  {q,  u),  where  u  is  a  returned  URL, 
to  obtain  a  predicted  ranking  of  all  of  the  URL’s. 

For  evaluation,  we  divided  the  data  into  training  and 
test  sets  using  four-fold  cross-validation.  We  created  four 
partitions  of  the  data,  each  one  using  75%  of  the  base  queries 
for  training  and  25%  for  testing.  Of  course,  the  learning 
algorithms  had  no  access  to  the  test  data  during  training. 

Experimental  parameters  and  evaluation.  Since  all 
search  templates  had  access  to  the  same  set  of  documents,  if 
a  URL  was  not  returned  in  the  top  30  documents  by  a  search 
template,  we  interpreted  this  as  ranking  the  URL  below  all 
of  the  returned  documents.  Thus  we  set  the  parameter 
^dcf,  the  default  value  for  weak  hypotheses,  to  be  0  (see 
Section  4). 

In  order  to  determine  a  good  number  of  boosting  rounds, 
we  first  ran  RankBoost  on  each  partition  of  the  data  and 
produced  a  graph  of  the  average  training  error  (omitted  due 
to  space  constraints).  On  average,  the  training  error  reached 
zero  after  85  rounds  of  boosting,  so  that  is  the  number 
of  boosting  rounds  that  we  used  in  all  of  the  meta-search 
experiments. 

To  evaluate  the  performance  of  the  individual  search 
templates  in  comparison  to  the  combined  hypothesis  out¬ 
put  by  RankBoost,  we  measured  the  number  of  queries  for 
which  the  correct  document  was  in  the  top  k  ranked  doc¬ 
uments,  for  various  values  of  k.  We  then  compared  the 
performance  of  the  combined  hypothesis  to  that  of  the  best 
search  template  for  each  value  of  k.  The  results  for  the 
ML  and  UNIV  domains  are  shown  in  Table  1 .  All  columns 
except  the  last  give  the  number  of  base  queries  for  which 
the  correct  homepage  was  retrieved  above  rank  k.  Bold 
figures  give  the  maximum  value  over  all  of  the  search  tem¬ 
plates  on  the  test  data.  Note  that  the  best  search  template  is 
determined  based  on  its  performance  on  the  test  data,  while 
RankBoost  only  has  access  to  training  data. 

For  the  ML  data  set,  the  combined  hypothesis  closely 
tracked  the  performance  of  the  best  expert  at  every  value  of 
k,  which  is  especially  interesting  since  no  single  template 
was  the  best  for  all  values  of  k.  For  the  UNIV  data  set,  a 
single  template  was  the  best^  for  all  values  of  k,  and  the 
combined  hypothesis  performed  almost  as  well  as  the  best 
template  for  fc  =  1, 2, . . . ,  10  and  then  outperformed  the 
best  template  for  k  =  20, 30. 

^The  best  query  expansion  heuristic  for  the  UNIV  domain  was 
"NAME"  PLACE. 


We  also  computed  (an  approximation  to)  average  rank, 
i.e.,  the  rank  of  the  correct  homepage  URL,  averaged  over 
all  base  queries  in  the  test  set.  Since  the  correct  URL 
was  sometimes  not  ranked  or  given  a  very  high  rank,  we 
artificially  assigned  a  rank  of  3 1  to  every  document  that  was 
either  unranked  or  ranked  above  rank  30.  We  also  limited 
the  maximum  rank  in  the  output  generated  by  RankBoost 
to  31  to  compensate  for  the  fact  that  31  was  the  maximum 
rank  that  can  be  assigned  by  any  single  search  template. 

The  last  column  of  Table  1  gives  average  rank.  This 
table  illustrates  the  robustness  of  the  combined  hypothesis 
on  the  ML  domain.  It  outperforms  the  best  template  for 
all  measures  except  top  1,  where  it  differs  from  the  best 
expert  by  12%,  and  top  2,  where  it  differs  by  2%.  On  the 
UNIV  queries,  the  combined  hypothesis  is  almost  always 
competitive  with  the  best  template  for  every  value  of  k, 
with  the  exception  of  A:  =  1,  where  it  trails  the  best  expert 
by  15%.  (Nevertheless,  since  this  domain  included  such 
a  good  template,  there  is  little  reason  to  use  something  as 
complicated  as  RankBoost.) 

5.2  Movie  recommendations 

We  also  tested  RankBoost  on  the  movie-recommendations 
task  described  in  the  introduction.  For  our  experiments, 
we  used  publicly  available"*  data  provided  by  the  Digi¬ 
tal  Equipment  Corporation  which  ran  its  own  EachMovie 
recommendation  service  for  eighteen  months  from  March 
1996  to  September  1997  and  collected  user  preference 
data.  Users  were  able  to  assign  a  movie  a  score  from 
the  set  R  =  {0.0, 0.2, 0.4, 0.6, 0.8, 1.0},  1.0  being  the 
best.  We  used  the  data  of  61,625  users  entering  a  total 
of  2,811,983  numeric  ratings  for  1,628  different  movies 
(films  and  videos). 

Most  of  the  mapping  of  this  problem  into  our  frame¬ 
work  was  described  in  Section  2.  For  our  experiments,  wc 
selected  a  subset  C  of  the  users  to  serve  as  ranking  features: 
each  user  in  C  defined  an  ordering  of  the  set  of  movies 
which  he  or  she  viewed.  We  did  not  set  the  parameter  q^^-f, 
allowing  the  weak  learner  to  choose  it  adaptively.  The  feed¬ 
back  function  <I)  was  then  defined  as  in  Section  2  using  the 
movie  ratings  of  a  single  target  user.  We  used  half  of  the 
movies  viewed  by  the  selected  target  user  for  the  feedback 
function  in  training,  and  used  the  other  half  of  the  viewed 
movies  for  testing  as  described  below.  We  then  averaged 
all  results  over  many  runs  with  many  different  target  users. 
In  these  experiments,  we  ran  RankBoost  for  100  rounds. 


"'From  ‘http://www.rescarch.digital.com/SRC/eachmovic/’. 
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Figure  4:  Performance  of  the  algorithms  on  different  feature  densities. 


We  compared  the  performance  of  RankBoost  on  this 
data  set  to  two  other  algorithms,  a  regression  algorithm  and 
a  nearest-neighbor  algorithm. 

Regression.  We  used  a  regression  algorithm  similar 
to  the  ones  used  by  Hill  and  others  [8].  The  regression 
algorithm  employs  the  assumption  that  the  preferences  of  a 
target  user  Alice  can  be  described  as  a  linear  combination 
of  the  preferences  of  other  users.  Formally,  let  a  be  a  row 
vector  whose  components  are  the  scores  Alice  assigned  to 
movies  (discarding  unranked  movies).  Let  C  be  a  matrix 
containing  the  scores  of  the  other  users  for  the  subset  of 
movies  that  Alice  has  ranked.  Since  some  of  the  users  have 
not  ranked  movies  that  were  ranked  by  Alice,  we  need  to 
decide  on  a  default  rank  for  these  movies.  For  each  user 
represented  by  a  row  in  C,  we  set  the  score  of  the  user’s 
unranked  movies  to  be  the  user’s  average  score  over  all 
movies.  We  next  use  linear  regression  to  find  a  vector  w 
of  minimum  length  which  minimizes  |  luJ  C  -  o|  | .  This  can 
be  done  using  standard  numerical  techniques  (we  used  the 
package  available  in  Matlab).  Given  w  we  can  now  predict 
the  ratings  of  all  the  movies. 

Nearest  neighbor.  Given  a  target  user  Alice  with 
certain  movie  preferences,  the  nearest-neighbor  algorithm 
(NN)  finds  a  user  Bob  whose  preferences  are  most  similar 
to  Alice’s  and  then  uses  Bob’s  preferences  to  make  rec¬ 
ommendations  for  Alice.  More  specifically,  we  find  the 
ranking  feature  fi  (corresponding  to  one  of  the  other  movie 
viewers)  which  gives  an  ordering  most  similar  to  that  of  the 
target  user  as  encoded  by  the  feedback  function  O.  The 
measure  of  similarity  we  use  is  the  ranking  loss  of  /*  with 
respect  to  the  same  initial  distribution  D  which  was  con¬ 
structed  by  RankBoost.  Thus,  in  some  sense,  NN  can  be 
viewed  as  a  single  weak  hypothesis  output  after  one  round 
of  RankBoost  (although  no  threshold  of  /,  is  performed). 

A  problem  with  this  algorithm  is  that  the  user  it  selects 
may  not  rank  all  the  movies  ordered  by  the  target  user.  To 
fix  this,  we  modified  NN  to  associate  with  each  feature  /»  a 
default  rank  gaef  €  R  which  assigns  to  unranked  movies. 
When  searching  for  the  best  feature,  NN  chooses  fer  by 
calculating  and  then  minimizing  the  ranking  loss  for  each 
possible  value  of  If  it  is  the  case  that  this  user  ranks  all 
of  the  movies  seen  by  the  target  user,  then  NN  sets  gjef  to 
the  average  rank  over  all  movies  that  it  ranked  (including 
those  not  ranked  by  the  target  user). 

In  order  to  evaluate  and  compare  performance,  we  used 
four  different  error  measures.  We  assume  that  the  learning 
system  produces  a  real-valued  function  H  which  orders 
instances  in  the  usual  way  (xi  ranked  higher  than  xq  if 


H{xi)  >  H{xo)).  We  compare  the  ordering  of  /f  to  a 
“correct”  ordering  c  over  test  instances,  also  represented 
formally  as  a  real-valued  function.  For  simplicity,  we  here 
only  give  definitions  for  these  measures  when  H  defines  a 
total  order  of  all  instances  so  that  no  ties  occur  in  either 
order.  The  definitions  can  be  extended  by  assuming  that 
ties  are  broken  randomly  and  taking  expectations  (details 
omitted  for  lack  of  space). 

All  our  measures  have  range  [0, 1],  with  a  value  0  being 
a  “perfect”  score. 

Disagreement.  Disagreement  is  the  fraction  of  distinct 
pairs  of  instances  which  are  misordered  by  H  (with  respect 
to  c).  If  c  were  used  to  construct  a  feedback  function,  this 
would  be  equivalent  to  the  ranking  loss  of  H. 

Predicted-rank-of-top  (PROT).  This  is  the  minimum 
rank  (according  to  H)  of  any  of  the  truly  top-rated  instances 
(according  to  c).  The  score  is  then  rescaled  to  have  a 
possible  range  of  [0, 1]. 

Coverage.  This  is  the  maximum  rank  (according  to  H) 
of  any  of  the  truly  top-rated  instances  (according  to  c).  The 
score  is  then  rescaled  to  have  a  possible  range  of  [0, 1]. 
(Note  that  coverage  and  PROT  are  equal  if  there  is  a  unique 
top-rated  instances  according  to  c.) 

Rank-of-predicted-top  (ROPT).  This  is  the  number 
of  instances  ranked  strictly  higher  (according  to  c)  than  the 
predicted  top-rated  instance  (according  to  H).  The  score  is 
then  rescaled  to  have  a  possible  range  of  [0, 1]. 

We  now  describe  our  experimental  results.  We  ran 
a  series  of  three  tests,  examining  the  performance  of  the 
algorithms  as  we  varied  the  number  of  features,  the  density 
of  the  features  (number  of  movies  ranked  by  each  user),  and 
the  density  of  the  feedback. 

We  first  experimented  with  the  number  of  features  used 
for  ranking.  We  selected  two  disjoint  random  sets  T  and  T' 
of  2000  users  each.  We  further  divided  T  into  six  subsets 
Ti ,  T2, . . . ,  Te  of  respective  sizes  100, 200, 500, 750, 1000, 
2000,  such  that  Ti  C  T2  C  ■  •  •  C  Te.  Each  Tj  served  as 
a  feature  set  for  training  on  half  of  a  target  user’s  movies 
and  testing  on  the  other  half,  for  each  user  in  T'.  For 
each  algorithm,  we  calculated  the  measures  described  above 
averaged  over  the  2000  test  users.  We  ran  the  algorithms 
on  five  disjoint  random  splits  of  the  data  into  feature  and 
feedback  sets,  and  we  averaged  the  results,  which  are  shown 
in  Figure  3. 

RankBoost  was  the  clear  winner  for  all  four  perfor¬ 
mance  measures.  The  performance  of  regression  was  much 
poorer,  and  NN  was  in  between.  For  the  most  part,  the 
performance  of  the  algorithms  improved  as  the  number  of 


178  Freund,  Iyer,  Schapire,  and  Singer 


Disagreements 


Coverage 


Predicled-rank-of-top 


Rank-ol-predicled-top 


0.48' 

0.46 

0.44 

0.42 

0.40 

0.38 

0.36 

0.34 


Regression 


0  50  100  150 

Figure  5:  Performance  of  algorithms  on  different  feedback  densities. 


features  increased.  RankBoost  and  NN  did  reasonably  well 
with  respect  to  disagreement,  which  they  directly  tried  to 
minimize,  while  regression’s  error  rate  was  just  slightly  bet¬ 
ter  than  50%.  All  three  algorithms  did  well  on  PROT  and 
ROPT,  although  again  regression  was  worse  (about  30% 
worse  than  RankBoost).  All  three  algorithms  had  difficulty 
with  coverage.  In  all  cases,  RankBoost  was  better  able  to 
use  the  increased  number  of  features. 

We  next  explored  the  effect  of  the  features  and  feedback 
density,  the  number  of  movies  ranked  by  each  user.  We 
partitioned  the  set  of  features  into  bins  according  to  their 
density.  The  bins  were  10-20,  21-40,  41-60,  61-100,  101- 
1455,  where  1455  was  the  maximum  number  of  movies 
ranked  by  a  single  user  in  the  data  set.  We  selected  a  random 
set  of  1000  features  (users)  from  each  bin  to  be  evaluated 
on  a  disjoint  random  set  of  1000  feedback  users  (of  varying 
densities).  We  ran  the  algorithms  on  six  such  random  splits, 
calculated  the  averages  of  the  four  error  measures  on  each 
split,  and  then  averaged  them  together.  The  results  are 
shown  in  Figure  4.  The  ^-coordinate  of  each  point  is  the 
average  density  of  the  features  in  a  single  bin;  for  example, 
80  is  the  average  density  of  features  whose  density  is  in  the 
range  61-100.  The  relative  performance  of  the  algorithms 
was  the  same  as  in  Figure  3.  RankBoost  was  again  able 
to  use  the  denser  features  to  obtain  lower  error  rates,  while 
the  improvement  of  NN  was  less  dramatic.  Regression 
actually  performed  the  same  or  worse  as  the  feature  density 
increased. 

We  varied  the  feedback  densities  in  the  same  way  as  the 
feature  densities.  We  used  a  random  set  of  1 000  features  and 
again  ran  on  six  random  splits,  taking  averages.  The  results 
appear  in  Figure  5.  As  feedback  density  increased,  Rank- 
Boost  and  NN  improved  with  respect  to  disagreement  and 
ROPT,  while  regression  perfonned  worse.  All  three  algo¬ 
rithms  did  well  on  PROT,  as  might  be  expected,  since  larger 
feedback  sets  will  likely  have  many  top-ranked  movies. 
For  the  same  reason,  all  three  algorithms  were  very  poor  on 
coverage. 

We  see  from  these  graphs  that  RankBoost  performed  the 
best  on  this  ranking  task.  RankBoost’s  approach  of  order¬ 
ing  based  on  relative  comparisons  performed  much  better 
than  regression  which  treats  the  movie  scores  as  absolute 
numerical  values.  RankBoost  also  improved  on  the  nearest- 
neighbor  algorithm  by  combining  multiple  features  to  form 
an  accurate  prediction  rule. 
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Abstract 

In  a  recent  paper,  Friedman,  Geiger,  and  Goldszmidt  [8] 
introduced  a  classifier  based  on  Bayesian  networks,  called 
Tree  Augmented  Naive  Bayes  (TAN),  that  outperforms 
naive  Bayes  and  performs  competitively  with  C4.5  and 
other  state-of-the-art  methods.  This  classifier  has  several 
advantages  including  robustness  and  polynomial  compu¬ 
tational  complexity.  One  limitation  of  the  TAN  classifier 
is  that  it  applies  only  to  discrete  attributes,  and  thus,  con¬ 
tinuous  attributes  must  be  prediscretized.  In  this  paper, 
we  extend  TAN  to  deal  with  continuous  attributes  directly 
via  parametric  (e.g.,  Gaussians)  and  semiparametric  (e.g., 
mixture  of  Gaussians)  conditional  probabilities.  The  result 
is  a  classifier  that  can  represent  and  combine  both  discrete 
and  continuous  attributes.  In  addition,  we  propose  a  new 
method  that  takes  advantage  of  the  modeling  language  of 
Bayesian  networks  in  order  to  represent  attributes  both  in 
discrete  and  continuous  form  simultaneously,  and  use  both 
versions  in  the  classification.  This  automates  the  process 
of  deciding  which  form  of  the  attribute  is  most  relevant 
to  the  classification  task.  It  also  avoids  the  commitment 
to  either  a  discretized  or  a  (semi)parametric  form,  since 
different  attributes  may  correlate  better  with  one  version 
or  the  other.  Our  empirical  results  show  that  this  latter 
method  usually  achieves  classification  performance  that  is 
as  good  as  or  better  than  either  the  purely  discrete  or  the 
purely  continuous  TAN  models. 

1  INTRODUCTION 

The  effective  handling  of  continuous  attributes  is  a  cen¬ 
tral  problem  in  machine  learning  and  pattern  recognition. 
Almost  every  real-world  domain,  including  medicine,  in¬ 
dustrial  control,  and  finance,  involves  continuous  attributes. 
Moreover,  these  attributes  usually  have  rich  interdependen¬ 
cies  with  other  discrete  attributes.  Many  approaches  in 
machine  learning  deal  with  continuous  attributes  by  dis¬ 
cretizing  them.  In  statistics  and  pattern  recognition,  on  the 
other  hand,  the  typical  approach  is  to  use  a  parametric  family 
of  distributions  (e.g.  Gaussians)  to  model  the  data. 

Each  of  these  strategies  has  its  advantages  and  disadvan¬ 
tages.  By  using  a  specific  parametric  family,  we  are  making 
strong  assumptions  about  the  nature  of  the  data.  If  these 
assumptions  are  warranted,  then  the  induced  model  can  be  a 
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good  approximation  of  the  data.  In  contrast,  discretization 
procedures  are  not  bound  by  a  specific  parametric  distribu¬ 
tion;  yet  they  suffer  from  the  obvious  loss  of  information. 
Of  course,  one  might  argue  that  for  specific  tasks,  such  as 
classification,  it  suffices  to  estimate  the  probability  that  the 
data  falls  in  a  certain  range,  in  which  case  discretization  is 
an  appropriate  strategy. 

In  this  paper,  we  introduce  an  innovative  approach  for 
dealing  with  continuous  attributes  that  avoids  a  commit¬ 
ment  to  either  one  of  the  strategies  outlined  above.  This 
approach  uses  a  dual  representation  for  each  continuous 
attribute:  one  discretized,  and  the  other  based  on  fitting 
a  parametric  distribution.  We  use  Bayesian  networks  to 
model  the  interaction  between  the  discrete  and  continuous 
versions  of  the  attribute.  Then,  we  let  the  learning  proce¬ 
dure  decide  which  type  of  representation  best  models  the 
training  data  and  what  interdependencies  between  attributes 
are  appropriate.  Thus,  if  attribute  B  can  be  modeled  as  a 
linear  Gaussian  depending  on  A,  then  the  network  would 
have  a  direct  edge  from  Ato  B.  On  the  other  hand,  if  the 
parametric  family  cannot  fit  the  dependency  of  B  on  A,  then 
the  network  might  use  the  discretized  representation  of  A 
and  B  to  model  this  relation.  Note  that  the  resulting  models 
can  (and  usually  do)  involve  both  parametric  and  discretized 
models  of  interactions  among  attributes. 

In  this  paper  we  focus  our  attention  on  classification  tasks. 
We  extend  a  Bayesian  network  classifier,  introduced  by 
Friedman,  Geiger,  and  Goldszmidt  (FGG)  [8]  called  “Tree 
Augmented  Naive  Bayes”  (TAN).  FGG  show  that  TAN  out¬ 
performs  naive  Bayes,  yet  at  the  same  time  maintains  the 
computational  simplicity  (no  search  involved)  and  robust¬ 
ness  that  characterize  naive  Bayes.  They  tested  TAN  on 
problems  from  the  UCI  repository  [16],  and  compared  it 
to  C4.5,  naive  Bayes,  and  wrapper  methods  for  feature  se¬ 
lection  with  good  results.  The  original  version  of  TAN 
is  restricted  to  multinomial  distributions  and  discrete  at¬ 
tributes.  We  start  by  extending  the  set  of  distributions  that 
can  be  represented  in  TAN  to  include  Gaussians,  mixtures 
of  Gaussians,  and  linear  models.  This  extension  results  in 
classifiers  that  can  deal  with  a  combination  of  discrete  and 
continuous  attributes  and  model  interactions  between  them. 
We  compare  these  classifiers  to  the  original  TAN  on  sev¬ 
eral  UCI  data  sets.  The  results  show  that  neither  approach 
dominates  the  other  in  terms  of  classification  accuracy. 

We  then  augment  TAN  with  the  capability  of  representing 
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each  continuous  attribute  in  both  parametric  and  discretized 
forms.  We  examine  the  consequences  of  the  dual  represen¬ 
tation  of  such  attributes,  and  characterize  conditions  under 
which  the  resulting  classifier  is  well  defined.  Our  main  hy¬ 
pothesis  is  that  the  resulting  classifier  will  usually  achieve 
classification  performance  that  is  as  good  or  better  than  both 
the  purely  discrete  and  purely  continuous  TAN  models.  This 
hypothesis  is  supported  by  our  experiments. 

We  note  that  this  dual  representation  capability  also  has 
ramifications  in  tasks  such  as  density  estimation,  cluster¬ 
ing,  and  compression,  which  we  are  currently  investigating 
and  some  of  which  we  discuss  below.  The  extension  of 
the  dual  representation  to  arbitrary  Bayesian  networks,  and 
the  extension  of  the  discretization  approach  introduced  by 
Friedman  and  Goldszmidt  [9]  to  take  the  dual  representation 
into  account,  are  the  subjects  of  current  research. 

2  REVIEW  OF  TAN 

In  this  discussion  we  use  capital  letters  such  as  X,Y,Z 
for  variable  names,  and  lower-case  letters  such  as  x,  y,  z 
to  denote  specific  values  taken  by  those  variables.  Sets 
of  variables  are  denoted  by  boldface  capital  letters  such  as 
X,  Y,  Z,  and  assignments  of  values  to  the  variables  in  these 
sets  are  denoted  by  boldface  lowercase  letters  x,y,z. 

A  Bayesian  network  over  a  set  of  variables  X  = 
{Xi , . . . ,  Xn)  is  an  annotated  directed  acyclic  graph  that 
encodes  a  joint  probability  distribution  over  X.  Formally, 
a  Bayesian  network  is  a  pair  B  =  {G,C).  The  first  com¬ 
ponent,  G,  is  a  directed  acyclic  graph  whose  vertices  cor¬ 
respond  to  the  random  variables  Xi, . . .  ,X„,  and  whose 
edges  represent  direct  dependencies  between  the  variables. 
The  second  component  of  the  pair,  namely  C,  represents 
a  set  of  local  conditional  probability  distributions  (CPDs) 
Li , . . . ,  L„,  where  the  CPD  for  X,-  maps  possible  values  Xj 
of  Xj  and  pa(Xi)  of  Pa(Xj),  the  set  of  parents  of  X,  in  G, 
to  the  conditional  probability  (density)  of  Xi  given  pa(X,). 
A  Bayesian  network  B  defines  a  unique  joint  probability 
distribution  (density)  over  X  given  by  the  product 

pb{Xu. . . , x„)  =  nr=i^i(^iipa(x.)) .  (1) 

When  the  variables  in  X  take  values  from  finite  discrete 
sets,  we  typically  represent  CPDs  as  tables  that  contain  pa¬ 
rameters  0ii|pa(Xi)  for  all  possible  values  of  X”,  and  Pa(Xj). 
When  the  variables  are  continuous,  we  can  use  various  para¬ 
metric  and  semiparametric  representations  for  these  CPDs. 

As  an  example,  let  X  =  {A\, . . . ,  An,C],  where  the 
variables  A] , . . . ,  are  the  attributes  and  C  is  the  class 
variable.  Consider  a  graph  structure  where  the  class  variable 
is  the  root,  that  is,  Pa(G)  =  *0,  and  each  attribute  has  the 
class  variable  as  its  unique  parent,  namely,  Pa(Ai)  =  {C} 
for  all  1  <  j  <  n.  For  this  type  of  graph  structure.  Equa¬ 
tion  1  yields  Pr(A|, . . . ,  A„,  C)  =  Pr(G)  ■  fliLi  PrC^ijG). 
From  the  definition  of  conditional  probability,  we  get 
Pr(G|Ai, . . . ,  An)  =  Q-Pr(G)  ■  I]["=i  Fx{Ai\C),  where  a  is 
a  normalization  constant.  This  is  the  definition  of  the  naive 
Bayesian  classifier  commonly  found  in  the  literature  [5]. 

The  naive  Bayesian  classifier  has  been  used  extensively 
for  classification.  It  has  the  attractive  properties  of  being 
robust  and  easy  to  learn — we  only  need  to  estimate  the  CPDs 
Pr(G)  and  Pr(Aj  |  G)  for  all  attributes.  Nonetheless,  the 
naive  Bayesian  classifier  embodies  the  strong  independence 


Figure  1 :  A  TAN  model  learned  for  the  data  set  “glass2.”  The 
dashed  lines  represent  edges  required  by  the  naive  Bayesian  elas- 
sifier.  The  solid  lines  are  the  tree  augmenting  edges  representing 
correlations  between  attributes. 

assumption  that,  given  the  value  of  the  class,  attributes  arc 
independent  of  each  other.  FGG  [8]  suggest  the  removal 
of  these  independence  assumptions  by  considering  a  richer 
class  of  networks.  They  define  the  TAN  Bayesian  classifier 
that  learns  a  network  in  which  each  attribute  has  the  class  and 
at  most  one  other  attribute  as  parents.  Thus,  the  dependence 
among  attributes  in  a  TAN  network  will  be  represented  via  a 
tree  structure.  Figure  1  shows  an  example  of  a  TAN  network. 

In  a  TAN  network,  an  edge  from  A;  to  Aj  implies  that  the 
influence  of  A;  on  the  assessment  of  the  class  also  depends 
on  the  value  of  Aj.  For  example,  in  Figure  1 ,  the  influence 
of  the  attribute  “Iron”  on  the  class  C  depends  on  the  value  of 
“Aluminum,”  while  in  the  naive  Bayesian  classifier  the  in¬ 
fluence  of  each  attribute  on  the  class  is  independent  of  other 
attributes.  These  edges  affect  the  classification  process  in 
that  a  value  of  “Iron”  that  is  typically  surprising  (i.e.,  P(r|r,) 
is  low)  may  be  unsurprising  if  the  value  of  its  correlated 
attribute,  “Aluminum,”  is  also  unlikely  (i.e.,  P{i\c,a)  is 
high).  In  this  situation,  the  naive  Bayesian  classifier  will 
overpcnalize  the  probability  of  the  class  by  considering  two 
unlikely  observations,  while  the  TAN  network  of  Figure  1 
will  not  do  so,  and  thus  will  achieve  better  accuracy. 

TAN  networks  have  the  attractive  property  of  being  learn- 
able  in  polynomial  time.  FGG  pose  the  learning  problem  as 
a  search  for  the  TAN  network  that  has  the  highest  likelihood 
LL{B  :  D)  =  Pb(£)),  given  the  data  Z9.  Roughly  speaking, 
networks  with  higher  likelihood  match  the  data  better.  FGG 
describe  a  procedure  Construct-TAN  for  learning  TAN 
models  and  show  the  following  theorem. 

Theorem  2.1:  [8]  Let  D  be  a  collection  of  N  instances  of 
G,  Ai , . . . ,  An.  The  procedure  Construct-TAN  builds  a 
TAN  network  B  that  mcLximizes  LL{B  :  D)  and  has  time 
complexity  0{n}  ■  N). 

The  TAN  classifier  is  related  to  the  classifier  introduced 
by  Chow  and  Liu  [2].  That  method  learns  a  different  tree 
for  each  class  value.  FGG’s  results  show  that  the  TAN  and 
Chow  and  Liu’s  classifier  perform  roughly  the  same.  In 
domains  where  there  is  substantial  differences  in  the  inter¬ 
actions  between  attributes  for  different  class  values,  Chow 
and  Liu’s  method  performs  better.  In  others,  it  is  possible 
to  learn  a  better  tree  by  pooling  the  examples  from  different 
classes  as  done  by  TAN.  Although  we  focus  on  extending 
the  TAN  classifier  here,  all  of  our  ideas  easily  apply  to  clas¬ 
sifiers  that  learn  a  different  tree  for  each  class  value. 

3  GAUSSIAN  TAN 

The  TAN  classifier,  as  described  by  FGG,  applies  only  to 
discrete  attributes.  In  experiments  run  on  data  sets  with 
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continuous  attributes,  FGG  use  the  prediscretizion  described 
by  Fayyad  and  Irani  [7]  before  learning  a  classifier.  In 
this  paper,  we  attempt  to  model  the  continuous  attributes 
directly  within  the  TAN  network.  To  do  so,  we  need  to  learn 
CPDs  for  continuous  attributes.  In  this  section,  we  discuss 
Gaussian  distributions  for  such  CPDs.  The  theory  of  training 
such  representations  is  standard  (see,  for  example,  [1,  5]). 
We  only  review  the  indispensable  concepts. 

A  more  interesting  issue  pertains  to  the  structure  of  the 
network.  As  we  shall  see,  when  we  mix  discrete  and  contin¬ 
uous  attributes,  the  algorithms  must  induce  directed  trees. 
This  is  in  contrast  to  the  procedure  of  FGG,  which  learns 
undirected  trees  and  then  arbitrarily  chooses  a  root  to  define 
edge  directions.  We  describe  the  procedure  for  inducing 
directed  trees  next. 

3.1  THE  BASIC  PROCEDURE 

We  now  extend  the  TAN  algorithm  for  directed  trees.  This 
extension  is  fairly  straight  forward  and  similar  ideas  have 
been  suggested  for  learning  tree-like  Bayesian  networks 
[12].  For  completeness,  and  to  facilitate  later  extensions, 
we  rederive  the  procedure  from  basic  principles.  Assume 
that  we  are  given  a  data  set  D  that  consists  of  N  identically 
and  independently  distributed  (i.i.d.)  instances  that  assign 
values  to  A] , . . . ,  and  C.  Also  assume  that  we  have 
specified  the  class  of  CPDs  that  we  are  willing  to  consider. 
The  objective  is,  as  before,  to  build  a  network  that  maxi¬ 
mizes  the  likelihood  function  LL{B  :  D)  =  log  Ps(D). 

Using  Eq.  (1)  and  the  independence  of  training  instances, 
it  is  easy  to  show  that 

LL{B-.D)  =  'ZiY.U^ogU{x{\M^iy) 

=  EiS{Xi  I  Pa(Xi)  :  Li),  (2) 

where  and  Pa(A’i)^  are  the  values  of  Xi  and  Pa(Xj)  in  the 
jth  instance  in  D.  We  denote  by  S{Xi  \  Pa(Xi))  the  value 
attained  by  S{Xi  j  Pa{Xi),Li)  when  Li  is  the  optimal 
CPD  for  this  family,  given  the  data,  and  the  set  of  CPDs 
we  are  willing  to  consider  (e.g.,  all  tables,  or  all  Gaussian 
distributions).  “Optimal”  should  be  understood  in  terms  of 
maximizing  the  likelihood  function  in  Eq.  (2). 

We  now  recast  this  decomposition  in  the  special  class 
of  TAN  networks.  Recall  that  in  order  to  induce  a  TAN 
network,  we  need  to  choose  for  each  attribute  Ai  at  most 
one  parent  other  than  the  class  C.  We  represent  this  selection 
by  a  function  7r(«),  s.t.,  if  7r(i)  =  0,  then  C  is  the  only  parent 
of  Ai,  otherwise  both  and  C  are  the  parents  of  Ai.  We 
define  LL(7r  ;  D)  to  be  the  likelihood  of  the  TAN  model 
specified  by  tt,  where  we  select  an  optimal  CPD  for  each 
parent  set  specified  by  tt.  Rewriting  Eq.  (2),  we  get 

LL(7r  :  D) 

—  11i,Tz{i)>0^(-^i  I  *^i^7r(i))  + 

=  Ei,.ii)>oiS{Ai  1  C,  -  SiAi  1  C))  -f 
Y,iS{Ai  1  C)  +  S{C  I  0) 

=  1  ~  ^{Ai  1  C))  +  C, 

where  c  is  some  constant  that  does  not  depend  on  tt.  Thus, 
we  need  to  maximize  only  the  first  term.  This  maximization 


can  be  reduced  to  a  graph-theoretic  maximization  by  the 
following  procedure,  which  we  call  Directed-TAN; 

1.  Initialize  an  empty  graph  G  with  n  vertices  labeled 


l,...,n. 

2.  For  each  attribute  Ai,  find  the  best  scoring  CPD  for 
P{Ai  I  C)  and  compute  S{Ai  |  C).  For  each  Aj  with 
j  ^  i,  if  an  arc  from  Aj  to  Ai  is  legal,  then  find  the 
best  CPD  for  P{Ai  |  C,Ai),  compute  S{Ai  |  C,Aj), 
and  add  to  G  an  arc  j  ->  i  with  weight  S{Ai  |  C,  Aj)  - 


S{Ai  I  C). 

3.  Find  a  set  of  arcs  A  that  is  a  maximal  weighted  branching 
in  A  branching  is  a  set  of  edges  that  have  at  most  one 
member  pointing  into  each  vertex  and  does  not  contain 
cycles.  Finding  a  maximally  weighed  branching  is  a 
standard  graph-theoretic  problem  that  can  be  solved  in 
low-order  polynomial  time  [6,  17]. 

4.  Construct  the  TAN  model  that  contains  arc  from  C  to 
each  Aj ,  and  arc  from  A  j  to  A  j  if  7  — >  i  is  in  A.  For  each 
Ai,  assign  it  the  best  CPD  found  in  step  2  that  matches 
the  choice  of  arcs  in  the  branching. 

From  the  arguments  we  discussed  above  it  is  easy  to  see  that 
this  procedure  constructs  the  TAN  model  with  the  highest 
score.  We  note  that  since  we  are  considering  directed  edges, 
the  resulting  TAN  model  might  be  a  forest  of  directed  trees 
instead  of  a  spanning  tree. 


Theorem  3.1:  The  procedure  Directed-TAN  constructs 
a  TAN  network  B  that  maximizes  LL{B  :  D)  given  the 
constraints  on  the  CPDs  in  polynomial  time. 

In  the  next  sections  we  describe  how  to  compute  the  op¬ 
timal  S  for  different  choices  of  CPDs  that  apply  to  different 
types  of  attributes. 


3.2  DISCRETE  ATTRIBUTES 

Recall  that  if  Aj  is  discrete,  then  we  model  P(Ai  |  Pa(Ai)) 
by  using  tables  that  contain  a  parameter  0aj|pa(>ii)  for  each 
choice  of  values  for  Aj  and  its  parents.  Thus, 

5(Ai  I  Pa(Ai))  =  ^logP(a^  1  pa(Ai)^) 
j 

=  N  P(ai,pa(Ai))log0ai|pa(.4i). 

Ot,pa(i4i) 

where  P{-)  is  the  empirical  frequency  of  events  in  the  train¬ 
ing  data.  Standard  arguments  show  that  the  maximum  like¬ 
lihood  choice  of  parameters  is  P{x  \  y)  =  P{x  \  y). 
Making  the  appropriate  substitution  above,  we  get  a  nice 
information -theoretic  interpretation  of  the  weight  of  the  arc 
fromAj  to  Aj,  S(Aj  |  C,Aj)-S{Ai  \  C)  =  N -liA^Aj  \ 
C).  The  /()  term  is  the  conditional  mutual  information  be¬ 
tween  Aj  and  Aj  given  C  [3].  Roughly  speaking,  it 
measures  how  much  information  Aj  provides  about  Aj  if 
already  know  the  value  of  C.  In  this  case,  our  procedure 
reduces  to  Construct-TAN  of  FGG,  except  that  they  use 
/(Aj;  Aj  1  C)  directly  as  the  weight  on  the  arcs,  while  we 
multiply  these  weights  by  N. 

3.3  CONTINUOUS  ATTRIBUTES 

We  now  consider  the  case  where  X  is  continuous.  There 
are  many  possible  parametric  models  for  continuous  vari¬ 
ables.  Perhaps  the  easiest  one  to  use  is  the  Gaussian 
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distribution.  A  continuous  variable  is  a  Gaussian  with 
mean  fi  and  variance  if  the  pdf  of  X  has  the  form 

(p{x  :  .  If  all  the  parents  of  a 

continuous  A;  are  discrete,  then  we  learn  a  conditional 
Gaussian  CPD  [11,  15]  by  assigning  to  A,  different  mean 
fiai\pa{Ai)  variance  (r^a,|pa(/io  for  o^oh  joint  value  of 
its  parents.  Standard  arguments  (e.g.,  see  [1])  show  that  we 
can  rewrite  5(Ai  |  Pa(Ai))  as  a  function  of  E[Ai  \  pa(A,)] 
and  £'[A;  |  pa(A,)] — the  expectations  of  A;  and  A?  in 
these  instances  of  the  data  where  Pa(A,)  take  a  particular 
value.  Standard  arguments  also  show  that  we  maximize  the 
likelihood  score  of  the  CPD  by  choosing 

/rAi|pa(/ti)  ~  E\Ai  I  pa(Ai)] 
fr^/i,|pa(>ti)  =  E[A]  I  pa(Ai)]  -  E\Ai  \  pa(A,)]  - 
When  we  learn  TAN  models  in  domains  with  many  con¬ 
tinuous  attributes,  we  also  want  to  have  families  where  one 
continuous  attribute  is  a  parent  of  another  continuous  at¬ 
tribute.  In  the  Gaussian  model,  we  can  represent  such 
CPDs  by  using  a  linear  Gaussian  relation.  In  this  case, 
the  mean  of  Aj  depends,  in  a  linear  fashion,  on  the  value 
of  Aj.  This  relationship  is  parameterized  by  three  parame¬ 
ters;  aAi\Aj,c,I^Ai\Aj,c  ^^da'^A,\Ai,c  for  each  value  c  of  the 
class  variable.  The  conditional  probability  for  this  CPD  is  a 
Gaussian  with  mean  otAi\Aj,c  +  Aj[iAi\Aj,c  ^nd  variance 
f^'^Ai\Aj,c-  Again,  by  using  standard  arguments,  it  is  easy  to 
show  that  S(Ai  |  Aj,  C)  is  a  function  of  low-order  statistics 
in  the  data,  and  that  the  maximal  likelihood  parameters  are 

_  E\AiAi\c]-E[M\c\E[Ai\c] 

-  E[AL\c\-E^^[Ai\c] 

-  E[Ai\c]-  (5AMi^c*E[Aj\c] 

=  E[A]  I  c]  -  E\Ai  I  c]  - 

(E\AiAM-E[AME[Ai\c]? 

In  summary,  to  estimate  parameters  and  to  evaluate  the 
likelihood,  we  need  only  to  collect  the  statistics  of  each 
pair  of  attributes  with  the  class,  that  is,  terms  of  the  form 
E[Ai  I  aj,c]  and  E[AiAj  \  cj.  Thus,  learning  in  the  case  of 
continuous  Gaussian  attributes  can  be  done  efficiently  in  a 
single  pass  over  the  data. 

When  we  learn  TAN  models  that  contain  discrete  and 
Gaussian  attributes,  we  restrict  ourselves  to  arcs  between 
discrete  attributes,  arcs  between  continuous  attributes,  and 
arcs  from  discrete  attributes  to  continuous  ones.  If  we  want 
also  to  model  arcs  from  continuous  to  discrete,  then  we  need 
to  introduce  additional  types  of  parametric  models,  such  as 
logistic  regression  [1].  As  we  will  show,  an  alternative 
solution  is  provided  by  the  dual  repre.sentation  approach 
introduced  in  this  paper. 

3.4  SMOOTHING 

One  of  the  main  risks  in  parameter  estimation  is  overfit¬ 
ting.  This  can  happen  when  the  parameter  in  question  is 
learned  from  a  very  small  sample  (e.g.,  predicting  Aj  from 
values  of  Aj  and  of  C  that  are  rare  in  the  data).  A  standard 
approach  to  this  problem  is  to  smooth  the  estimated  param¬ 
eters.  Smoothing  ensures  that  the  estimated  parameters  will 


0Ai\Aj,c 

^Ai\Aj,c 

(^'^Ai\Aj,c 


not  be  overly  sensitive  to  minor  changes  in  the  training  data. 
FGG  show  that  in  the  case  of  discrete  attributes,  smoothing 
can  lead  to  dramatic  improvement  in  the  performance  of  the 
TAN  classifier.  They  use  the  following  smoothing  rule  for 
the  discrete  case 

0  _  jV  P(pa(/t,))P(n,|pa(/t,))  +  .i  P(a,) 

“■lpa(-^.)  N-P{pa{A,))  +  s 


where  s  is  a  parameter  that  controls  the  magnitude  of  the 
smoothing  (FGG  use  .s  =  5  in  all  of  their  experiments.) 
This  estimate  uses  a  linear  combination  of  the  maximum 
likelihood  parameters  and  the  unconditional  frequency  of 
the  attribute.  It  is  easy  to  sec  that  this  prediction  biases  the 
learned  parameters  in  a  manner  that  depends  on  the  weight 
of  the  smoothing  parameter  and  the  number  of  “relevant” 
in.stanccs  in  the  data.  This  smoothing  operation  is  similar  to 
(and  motivated  by)  well-known  methods  in  statistics  such 
as  hierarchical  Bayesian  and  shrinkage  methods  [10]. 

We  can  think  of  this  smoothing  operation  as  pretending 
that  there  arc  s  additional  instances  in  which  Aj  is  dis¬ 
tributed  according  to  its  marginal  distribution.  This  imme¬ 
diately  suggests  how  to  smooth  in  the  Gaussian  case:  we 
pretend  that  for  these  additional  s  samples  A,-,  A?  have  the 
same  average  as  what  we  encounter  in  the  totality  of  the 
training  data.  Thus,  the  statistics  from  the  augmented  data 
arc 


E[Ai  I  pa(A,)] 
E[A]  I  pa(A0] 


Z/P(pa(/t,))g[/t.|pa(/ti)]+»'g[/t.-l 

Af.P(pa(/l,))  +  s 

A;A(pa(/ti))i;[/l]|pa(/t,)l  +  .-fi[/l]l 
N-P{pa{Ai)}+s 


We  then  use  these  adjusted  statistics  for  estimating  the  mean 
and  variance  of  A;  given  its  parents.  The  same  basic  smooth¬ 
ing  method  applies  for  e.stimating  linear  interactions  be¬ 
tween  continuous  attributes. 

4  SEMIPARAMETRIC  ESTIMATION 

Parametric  estimation  methods  assume  that  the  data  is  (ap¬ 
proximately)  distributed  according  to  a  member  of  the  given 
parametric  family.  If  the  data  behaves  differently  enough, 
then  the  resulting  classifier  will  degrade  in  performance. 
For  example,  suppose  that  for  a  certain  class  c,  the  attribute 
Aj  has  bimodal  distribution,  where  the  two  modes  x\  and 
X2  arc  fairly  far  apart.  If  we  use  a  Gaussian  to  estimate  the 
distribution  of  A,  given  C,  then  the  mean  of  the  Gaussian 
would  be  in  the  vicinity  of  ^  ^ .  Thus,  instances 

where  A;  has  a  value  near  /r  would  receive  a  high  probabil¬ 
ity,  given  the  class  c.  On  the  other  hand,  in.stanccs  where  A, 
has  a  value  in  the  vicinity  of  either  X|  or  xi  would  receive  a 
much  lower  probability  given  c.  Consequently,  the  support 
c  gets  from  A;  behaves  exactly  the  opposite  of  the  way  it 
should.  It  is  not  surprising  that  in  our  experimental  results, 
Gaussian  TAN  occasionally  performed  much  worse  than  the 
discretized  version  (see  Table  1). 

A  standard  way  of  dealing  with  such  situations  is  to  al¬ 
low  the  classifier  more  flexibility  in  the  type  of  distribu¬ 
tions  it  learns.  One  approach,  called  scmiparamctric  esti¬ 
mation,  learns  a  collection  of  parametric  models.  In  this 
approach,  we  model  P(A,  |  Pa(Ai))  using  a  mixture  of 
Gaussian  distributions:  P[Ai  \  pa(Ai))  =  ^j<p{Ai  : 
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where  the  parame¬ 
ters  specify  the  mean  and  variance  of  each  Gaussian  in  the 
mixture  and  .  |pa(  A;  ),j  the  weights  of  the  mixture  com¬ 
ponents.  We  require  that  the  mA,  |pa(Ai),j 
each  value  of  Pa(^i). 

To  estimate  P{Ai  \  pa(j4j)),  we  need  to  decide  on  the 
number  of  mixture  components  (the  parameter  j  in  the  equa¬ 
tion  above)  and  on  the  best  choice  of  parameters  for  that  mix¬ 
ture.  This  is  usually  done  in  two  steps.  First,  we  attempt  to 
fit  the  best  parameters  for  different  number  of  components 
(e.g.,  j  =  1,2,...),  and  then  select  an  instantiation  for  j 
based  on  a  performance  criterion. 

Because  there  is  no  closed  form  for  learning  the  pa¬ 
rameters  we  need  to  run  a  search  procedure  such  as  the 
Expectation-Maximization  (EM)  algorithm.  Moreover, 
since  EM  usually  finds  local  maxima,  we  have  to  run  it 
several  times,  from  different  initial  points,  to  ensure  that  we 
find  a  good  approximation  to  the  best  parameters.  This  op¬ 
eration  is  more  expensive  than  parametric  fitting,  since  the 
training  data  cannot  be  summarized  for  training  the  mixture 
parameters.  Thus,  we  need  to  perform  many  passes  over 
the  training  data  to  learn  the  parameters.  Because  of  space 
restrictions  we  do  not  review  the  EM  procedure  here,  and 
refer  the  reader  to  [1,  pp.  65-73]. 

With  regard  to  selecting  the  number  of  components  in  the 
mixture,  it  is  easy  to  see  that  a  mixture  with  k+ 1  components 
can  easily  attain  the  same  or  better  likelihood  as  any  mixture 
with  k  components.  Thus,  the  likelihood  (of  the  data)  alone 
is  not  a  good  performance  criterion  for  selecting  mixture 
components,  since  it  always  favors  models  with  a  higher 
number  of  components,  which  results  in  overfitting.  Hence, 
we  need  to  apply  some  form  of  model  selection.  The  two 
main  approaches  to  model  selection  are  based  on  cross- 
validation  to  get  an  estimate  of  true  performance  for  each 
choice  of  fc,  or  on  penalizing  the  performance  on  the  training 
data  to  account  for  the  complexity  of  the  learned  model.  For 
simplicity,  we  use  the  latter  approach  with  the  BIC/MDL 
penalization.  This  rule  penalizes  the  score  of  each  mixture 
with  where  k  is  the  number  of  mixture  components, 

and  N  is  the  number  of  training  examples  for  this  mixture 
(i.e.,  the  number  of  instances  in  the  data  with  this  specific 
value  of  the  discrete  parents). 

Once  more,  smoothing  is  crucial  for  avoiding  overfitting. 
Because  of  space  considerations  we  will  not  go  into  the  de¬ 
tails.  Roughly  speaking,  we  apply  the  Gaussian  smoothing 
operation  described  above  in  each  iteration  of  the  EM  proce¬ 
dure.  Thus,  we  assume  that  each  component  in  the  mixture 
has  a  preassigned  set  of  s  samples  it  has  to  fit. 

As  our  experimental  results  show,  the  additional  flexi¬ 
bility  of  the  mixture  results  in  drastically  improved  perfor¬ 
mance  in  the  cases  where  the  Gaussian  TAN  did  poorly  (see, 
for  example,  the  accuracy  of  the  data  sets  “anneal-U”  and 
“balance-scale”  in  Table  1).  In  this  paper,  we  learned  mix¬ 
tures  only  when  modeling  a  continuous  feature  with  discrete 
parents.  We  note,  however,  that  learning  a  mixture  of  linear 
models  is  a  relatively  straightforward  extension  that  we  are 
currently  implementing  and  testing. 


5  DUAL  REPRESENTATION 

The  classifiers  we  have  presented  thus  far  require  us  to  make 
a  choice.  We  can  either  prediscretize  the  attributes  and  use 
the  discretized  TAN,  or  we  can  learn  a  (semi)parametric 
density  model  for  the  continuous  attributes.  Each  of  these 
methods  has  its  advantages  and  problems:  Discretization 
works  well  with  nonstandard  densities,  but  clearly  loses 
much  information  about  the  features.  Semiparametric  esti¬ 
mation  can  work  well  for  “well-behaved”  multimodal  den¬ 
sities.  On  the  other  hand,  although  we  can  approximate  any 
distribution  with  a  mixture  of  Gaussians,  if  the  density  is 
complex,  then  we  need  a  large  number  of  training  instances 
to  learn  a  mixture  with  large  number  of  components,  with 
sufficient  confidence. 

The  choice  we  are  facing  is  not  a  simple  binary  one,  that 
is,  to  discretize  or  not  to  discretize  all  the  attributes.  We  can 
easily  imagine  situations  in  which  some  of  several  attributes 
are  better  modeled  by  a  semiparametric  model,  and  others 
are  better  modeled  by  a  discretization.  Thus,  we  can  choose 
to  discretize  only  a  subset  of  the  attributes.  Of  course,  the 
decision  about  one  attribute  is  not  independent  of  how  we 
represent  other  attributes.  This  discussion  suggests  that  we 
need  to  select  a  subset  of  variables  to  discretize,  that  is,  to 
choose  from  an  exponential  space  of  options. 

In  this  section,  we  present  a  new  method,  called  hybrid 
TAN,  that  avoids  this  problem  by  representing  both  the  con¬ 
tinuous  attributes  and  their  discretized  counterparts  within 
the  same  TAN  model.  The  structure  of  the  TAN  model  deter¬ 
mines  whether  the  interaction  between  two  attributes  is  best 
represented  via  their  discretized  representation,  their  con¬ 
tinuous  representation,  or  a  hybrid  of  the  discrete  represen¬ 
tation  of  one  and  the  continuous  representation  of  the  other. 
Our  hypothesis  is  that  hybrid  TAN  allows  us  to  achieve  per¬ 
formance  that  is  as  good  as  either  alternative.  Moreover, 
the  cost  of  learning  hybrid  TAN  is  about  the  same  as  that  of 
learning  either  alternative. 

Let  us  assume,  that  the  first  k  attributes,  A\,.  ..,Ak, 
are  the  continuous  attributes  in  our  domain.  We  denote  by 
A*, . . . ,  AJ!  the  corresponding  discretized  attributes  (i.e..  A* 
is  the  discretized  version  of  Ai),  based  on  a  predetermined 
discretization  policy  (e.g.,  using  a  standard  method,  such 
as  Fayyad  and  Irani’s  [7]).  Given  this  semantics  for  the 
discretized  variables,  we  know  that  that  each  A*  is  a  de¬ 
terministic  function  of  Aj.  That  is.  A*  state  corresponds 
to  the  interval  [a:  1,0:2]  if  and  only  if  Aj  €  [a;i,X2].  Thus, 
even  though  the  discretized  variables  are  not  observed  in  the 
training  data,  we  can  easily  augment  the  training  data  with 
the  discretized  version  of  each  continuous  attribute. 

At  this  stage  one  may  consider  the  application  of  one  of 
the  methods  we  described  above  to  the  augmented  training 
set.  This,  however,  runs  the  risk  of  “double  counting”  the 
evidence  for  classification  provided  by  the  duplicated  at¬ 
tributes.  The  likelihood  of  the  learned  model  will  contain 
a  penalty  for  both  the  continuous  and  the  discrete  versions 
of  the  attribute.  Consequently,  during  classification,  a  “sur¬ 
prising”  value  of  an  attribute  would  have  twice  the  (neg¬ 
ative)  effect  on  the  probability  of  the  class  variable.  One 
could  avoid  this  problem  by  evaluating  only  the  likelihood 
assigned  to  the  continuous  version  of  the  attributes.  Unfor¬ 
tunately,  in  this  case  the  basic  decomposition  of  Eq.  (2)  no 
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longer  holds,  and  we  cannot  use  the  TAN  procedure. 


5.1  MODELING  THE  DUAL  REPRESENTATION 

Our  approach  takes  advantage  of  Bayesian  networks  to 
model  the  interaction  between  an  attribute  and  its  discrcti/xd 
version.  We  constrain  the  networks  we  learn  to  match  our 
model  of  the  discretization,  that  is,  a  discretized  attribute  is 
a  function  of  the  continuous  one.  More  specifically,  for  each 
continuous  attribute  A,,  we  require  that  Pb{A*  |  A^)  =  1 
iff  Aj  is  in  the  range  specified  by  A*.  It  is  ea.sy  to 
show  (using  the  chain  rule)  that  this  constraint  implies  that 
Pb(Ai,...,A„,A|',...,A;J)  =  Pb(Ai,...,A„)  avoiding 
the  problem  outlined  in  the  previous  paragraph. 

Note  that  by  imposing  this  constraint  we  are  not  requiring 
in  any  way  that  Ai  be  a  parent  of  A*.  However,  we  do  need 
to  ensure  that  P(A*  |  Ai)  is  deterministic  in  the  learned 
model.  We  do  so  by  requiring  that  Ai  and  A*  arc  adjacent 
in  the  graph  (i.e.,  one  is  the  parent  of  the  other)  and  by 
putting  restrictions  on  the  models  we  learn  for  P(Ai  |  A* ) 
and  P(A*  |  Ai).  There  arc  two  possibilities: 

If  Ai  ->  A-  is  in  the  graph,  then  the  conditional  distri¬ 
bution  P(A*  1  Ai,C)  is  determined  as  outlined  above;  it 
is  1  if  Ai  is  in  the  range  defined  by  the  value  of  AJ  and  0 
otherwise. 

If  A*  — >  Ai  is  in  the  graph,  then  we  require  that 
P(Ai  I  A,*  jC)  =  0  whenever  Ai  is  not  in  the  range  spec¬ 
ified  by  A*.  By  Bayes  rule  P(A*  |  Ai)  a  I 

Ai,C')P(A*,C);  Thus,  if  Ai  is  not  in  the  range  of  A*, 
then  P(A*  |  Ai)  oc  ^  ■P(^’>C')  =  0.  Since  the 

conditional  probability  of  A*  given  Ai  must  sum  to  1,  we 
conclude  that  P(Ai  |  Ai)  =  1  iff  Ai  is  in  the  range  of  AJ'. 

There  is  still  the  question  of  the  form  of  P(Ai  ]  A*,C). 
Our  proposal  is  to  learn  a  model  for  Ai  given  A*  and  C,  using 
the  standard  methods  above  (i.e.,  a  Gaussian  or  a  mixture 
of  Gaussians).  We  then  truncate  the  resulting  density  on 
the  boundaries  of  the  region  specified  by  the  discretization, 
and  we  ensure  that  the  truncated  density  has  total  mass  1  by 
applying  a  normalizing  constant.  In  other  words,  we  learn 
an  unrestricted  model,  and  then  condition  on  the  fact  that 
Ai  can  only  take  values  in  the  specified  interval. 

Our  goal  is  then  to  learn  a  TAN  model  that  includes  both 
the  continuous  and  discretized  versions  of  each  continuous 
attribute,  and  that  satisfies  the  restrictions  we  just  described. 
Since  these  restrictions  are  not  enforced  by  the  procedure  of 
Section  3.1,  we  need  to  augment  it.  We  start  by  observing 
that  our  restrictions  imply  that  if  we  include  P  ^  A  in  the 
model,  we  must  also  include  A  — »  A’ .  To  see  this,  note  that 
since  A  already  has  one  parent  (P)  it  cannot  have  additional 
parents.  Thus,  the  only  way  of  making  A  and  A*  adjacent 
is  by  adding  the  edge  A  — +  A*.  Similarly,  if  we  include  the 
edge  P  ^  A*,  we  must  also  include  A*  — »  A. 

This  observation  suggests  that  we  consider  edges  between 
groups  of  variables,  where  each  group  contains  both  versions 
of  an  attribute.  In  building  a  TAN  structure  that  includes 
both  representations,  we  must  take  into  account  that  adding 
an  edge  to  an  attribute  in  a  group,  immediately  constraints 
the  addition  of  other  edges  within  the  group.  Thus,  the  TAN 
procedure  should  make  choices  at  the  level  groups.  Such  a 
procedure,  which  we  call  hybrid-TAN  is  described  next. 
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Figure  2:  The  three  possible  ways  of  placing  an  edge  from  {D,  P*} 
into  {A,  A* }.  The  parameterization  of  possible  arcs  arc  as  follows: 
B'  — >  A*  is  a  discrete  model,  both  B'  —r  A  and  B  -r  A 
are  continuous  models  (e.g.,  Gaussians),  A‘  A  is  a  truncated 
continuous  model  (e.g.,  tmneated  Gaussian),  and  A  — >  A*  is  a 
deterministic  model. 

5.2  HYBRID-TAN 

We  now  expand  on  the  details  of  the  procedure.  As  with 
the  basic  procedure,  we  compute  scores  on  edges.  Now, 
however,  edges  are  between  groups  of  attributes.  Each  group 
consisting  of  the  different  representations  of  an  attribute. 

Let  A  be  a  continuous  attribute.  By  our  restriction,  either 
A  €  Pa(A*),  or  A*  £  Pa(A).  And  since  each  attribute  bas 
at  most  one  parent  (in  addition  to  the  class  C),  we  have  that  at 
mo.st  one  other  attribute  is  in  Pa(  A )  U  Pa(  A* )  -  {  A,  A* ,  (7} . 
We  define  a  new  function  T(A  |  P)  that  denotes  the  best 
combination  of  parents  for  A  and  A*  such  that  either  P  or 
P*  is  a  parent  of  one  of  these  attributes.  Similarly,  T(  A  |  0) 
denotes  the  best  configuration  such  that  no  other  attribute  is 
a  parent  of  A  or  A*. 

First,  consider  the  term  T(A  |  0).  If  we  decide  that 
neither  A  nor  A*  have  other  parents,  then  we  can  freely 
choose  between  A  — >  A*  and  A*  — »  A.  Thus 

T(A|0)=  max(  5(A  |  C,  A*)  +  5(A’ |  C), 
5(A|C)-f5(A*lC',A)), 

where  S(A  |  C,  A*)  and  5(A*  |  C,  A)  are  the  .scores  of 
the  CPDs  subject  to  the  constraints  discussed  in  Subsec¬ 
tion  5.1  (the  first  is  a  truncated  model,  and  the  second  is  a 
deterministic  model). 

Next,  consider  the  case  that  a  eontinuous  attribute  P  is 
a  parent  of  A.  There  are  three  possible  ways  of  placing  an 
edge  from  tbe  group  {P,  P*}  into  the  group  {A,  A*}.  These 
ca.ses  are  shown  in  Figure  2.  (The  fourth  case  is  disallowed, 
since  we  cannot  have  an  edge  from  the  continuous  attribute, 
P  to  the  discrete  attribute.  A*.)  It  is  easy  to  verify  that  in 
any  existing  TAN  network,  we  can  switch  between  the  edge 
configurations  of  Figure  2  without  introducing  new  cycles. 
Thus,  given  the  decision  that  the  group  P  points  to  the  group 
A,  we  would  choose  the  configuration  with  maximal  score: 

r(A|P)=  max(  5(A|C,P’)-b5(A*  IC.A), 
5(A|C,A*)-f5(A*  |C,P*), 
5(A|C,P)  +  5(A*  |C,A)) 

Finally,  when  P  is  discrete,  then  T{A  \  B)  is  the  maximum 
between  two  options  (P  as  a  parent  of  A  or  as  a  parent  of 
P*),  and  when  A  is  discrete,  then  T(A  |  P)  is  equal  to  one 
term  (either  ^(A  |  C,  P)  or  5(A  |  C,  P*),  depending  on 
P’s  type). 

We  now  define  the  Hybrid-TAN  procedure: 
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Figure  3:  A  hybrid  TAN  model  learned  for  the  data  set  “glass2.”  For  clarity,  the  edges  from  the  class  to  all  the  attributes  are  not  shown. 
The  attributes  marked  with  asterisks  (*)  correspond  to  the  discretized  representation.  Dotted  boxes  mark  two  versions  of  the  same 
attribute. 
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Figure  4:  Differences  in  the  modeling  of  the  interaction  between  attributes,  for  mixtures  of  Gaussians  and  the  hybrid  model.  The  graphs 
show  the  interaction  between  Calcium  (C)  and  Magnesium  (M)  in  the  “glass2”  data  set,  given  a  specific  value  of  the  class. 


1.  Initialize  an  empty  graph  G  with  n  vertices  labeled 

1, . . .  ,n. 

2.  For  each  attribute  Ai,  compute  the  scores  of  the  form 

I  C),  S{At  I  C),  S{Ai  1  C,A*i),  etc.  For  each 
Aj  with  j  i,  add  to  G  an  arc  j  — >  i  with  weight 
T{Ai\Aj)-T{Ai\tll). 

3.  Find  a  maximal  weighted  branching  A  inG- 

4.  Construct  the  TAN  model  that  contains  edges  from  C  to 
each  Ai  and  A*.  If  y  — » i  is  in  ,4,  add  the  best  configu¬ 
ration  of  edges  (and  the  corresponding  CPDs)  from  the 
group  Aj  into  Ai.  If  i  does  not  have  an  incoming  arc  in 
A,  then  add  the  edge  between  Ai  and  A^  that  maximizes 
T{Ai  :  0). 

It  is  straight  forward  to  verify  that  this  procedure  performs 
the  required  optimization: 

Theorem  5.1:  The  procedure  Hybrid-TAN  constructs  in 
polynomial  time  a  dual  TAN  network  B  that  maximizes 
LL{B  :  D),  given  the  constraints  on  the  CPDs  and  the 
constraint  that  Ai  and  A*  are  adjacent  in  the  graph. 

5.3  AN  EXAMPLE 

Figure  3  shows  an  example  of  a  hybrid  TAN  model  learned 
from  one  of  the  folds  of  the  “glass2”  data  set.'  It  is  instruc¬ 
tive  to  compare  it  to  the  network  in  Figure  1,  which  was 
learned  by  a  TAN  classifier  based  on  mixtures  of  Gaussians 
from  the  same  data  set.  As  we  can  see,  there  are  some 
similarities  between  the  networks,  such  as  the  connections 
between  “Silicon”  and  “Sodium,”  and  between  “Calcium” 
and  “Magnesium”  (which  was  reversed  in  the  hybrid  ver¬ 
sion).  However,  most  of  the  network’s  structure  is  quite 

'Some  of  the  discrete  attributes  do  not  appear  in  the  figure, 
since  they  were  discretized  into  one  bin. 


different.  Indeed,  the  relation  between  “Magnesium”  and 
“Calcium”  is  now  modulated  by  the  discretized  version  of 
these  variables.  This  fact,  and  the  increased  accuracy  of  hy¬ 
brid  TAN  for  this  data  set  (see  Table  1 ),  seem  to  indicate  that 
in  this  domain  attributes  are  not  modeled  well  by  Gaussians. 

As  a  further  illustration  of  this,  we  show  in  Figure  4  the 
estimate  of  the  joint  density  of  “Calcium”  and  “Magnesium” 
in  both  networks  (given  a  particular  value  for  the  class),  as 
well  as  the  training  data  from  which  both  estimates  were 
learned.  As  we  can  see,  most  of  the  training  data  is  centered 
at  one  point  (roughly,  when  M  =  3.5  and  C  =  8),  but 
there  is  fair  dispersion  of  data  points  when  M  =  0.  In  the 
Gaussian  case,  C  is  modeled  by  a  mixture  of  two  Gaussians 
(centered  on  8.3  and  11.8,  where  the  former  has  most  of 
the  weight  in  the  mixture),  and  M  is  modeled  as  a  linear 
function  of  C  with  a  fixed  variance.  Thus,  we  get  a  sharp 
“bump”  at  the  main  concentration  point  on  the  low  ridge  in 
Figure  4a.  On  the  other  hand,  in  the  hybrid  model,  for  each 
attribute,  we  model  the  probability  in  each  bin  by  a  truncated 
Gaussian.  In  this  case,  C  is  partitioned  into  three  bins  and 
M  into  two.  This  model  results  in  the  discontinuous  density 
function  we  see  in  Figure  4b.  As  we  can  see,  the  bump 
at  the  center  of  concentration  is  now  much  wider,  and  the 
whole  region  of  dispersion  corresponds  to  a  low,  but  wide, 
“tile”  (in  fact,  this  tile  is  a  truncated  Gaussian  with  a  large 
variance). 

6  EXPERIMENTAL  EVALUATION 

We  ran  our  experiments  on  the  23  data  sets  listed  in  Table  1 . 
All  of  these  data  sets  are  from  the  UCI  repository  [16],  and 
are  accessible  at  the  MLC-I— I-  ftp  site.  The  accuracy  of 
each  classifier  is  based  on  the  percentage  of  successful  pre- 
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Figure  5:  Scatter  plots  comparing  the  performance  (a)  of  Disc  (x  axis)  vs.  Mix  (y  axis),  (b)  of  H/Mix  (x  axis)  vs.  Disc  and  Mix  {y  axis), 
and  (c)  of  H/Mix  (x  axis)  vs.  H/Mix-FS  (y  axis).  In  these  plots,  each  point  represents  a  data  set,  and  the  coordinates  correspond  to  the 
prediction  error  of  each  of  the  methods  compared.  Points  below  the  diagonal  line  correspond  to  data  sets  where  the  y  axis  method  is 
more  accurate,  and  points  above  the  diagonal  line  correspond  to  data  sets  where  the  x  axis  method  is  more  accurate.  In  (b),  the  dashed 
lines  connect  points  that  correspond  to  the  same  data  set. 


dictions  on  the  test  sets  of  each  data  set.  We  estimate  the 
prediction  accuracy  of  each  classifier  as  well  as  the  variance 
of  this  accuracy  by  using  the  MLC++  system  [14].  Ac¬ 
curacy  was  evaluated  using  5-fold  cross  validation  (using 
the  methods  described  in  [13]).  Since  we  do  not  currently 
deal  with  missing  data,  we  removed  instances  with  missing 
values  from  the  data  sets.  To  construct  discretizations,  we 
used  a  variant  of  the  method  of  Fayyad  and  Irani  [7],  using 
only  the  training  data,  in  the  manner  described  in  [4].  These 
preprocessing  stages  were  carried  out  by  the  MLC-t--l-  sys¬ 
tem.  We  note  that  experiments  with  the  various  learning 
procedures  were  carried  out  on  exactly  the  same  training 
sets  and  evaluated  on  exactly  the  same  test  sets. 

Table  1  summarizes  the  accuracies  of  the  learning  proce¬ 
dures  we  have  discussed  in  this  paper:  (1)  Disc-TAN  clas¬ 
sifier  based  on  prediscretized  attributes;  (2)  Gauss-TAN 
classifier  using  Gaussians  for  the  continuous  attributes  and 
multinomials  for  the  discrete  ones;  (3)  Mix-TAN  classifier 
using  mixtures  of  Gaussians  for  the  continuous  attributes; 
(4)  H/Gauss-hybrid  TAN  classifier  enabling  the  dual  repre¬ 
sentation  and  using  Gaussians  for  the  continuous  version  of 
the  attributes;  (5)  H/Mix-hybrid  TAN  classifier  using  mix¬ 
tures  of  Gaussian  for  the  continuous  version  of  the  attributes; 
and  (6)  H/Mix-FS-same  as  H/Mix  but  incorporating  a  prim¬ 
itive  form  of  feature  selection.  The  discretization  procedure 
often  removes  attributes  by  discretizing  them  into  one  inter¬ 
val.  Thus,  these  attributes  are  ignored  by  the  discrete  version 
of  TAN.  H/Mix-FS  imitate  this  feature  selection  by  also  ig¬ 
noring  the  continuous  version  of  the  attributes  removed  by 
the  discretization  procedure. 

As  we  can  see  in  Figure  5(a),  neither  the  discrete  TAN 
(Disc)  nor  the  mixture  of  Gaussians  TAN  (Mix)  outper¬ 
forms  the  other.  In  some  domains,  such  as  “anneal-U”  and 
“glass,”  the  discretized  version  clearly  performs  better;  in 
others,  such  as  “balance-scale,”  “hayes-roth,”  and  “iris,”  the 
semiparametric  version  performs  better.  Note  that  the  latter 
three  data  sets  are  all  quite  small.  So,  a  reasonable  hypothe¬ 
sis  is  that  the  data  is  too  sparse  to  learn  good  discretizations. 


On  the  other  hand,  as  we  can  see  in  Figure  5(b),  the  hybrid 
method  performs  at  roughly  the  same  level  as  the  best  of 
either  Mix  or  Disc  approaches.  In  this  plot,  each  pair  of 
connected  points  describes  the  accuracy  results  achieved  by 
Disc  and  Mix  for  a  single  data  set.  Thus,  the  best  accuracy  of 
these  two  methods  is  represented  by  the  lower  point  on  each 
line.  As  we  can  sec,  in  most  data  sets  the  hybrid  method 
performs  roughly  at  the  same  level  as  these  lower  points.  In 
addition,  in  some  domains  such  as  “glass2,”  “hayes-roth,” 
and  “hepatitis”  the  ability  to  model  more  complex  interac¬ 
tions  between  the  different  continuous  and  discrete  attributes 
results  in  a  higher  prediction  accuracy.  Finally,  given  the 
computational  cost  involved  in  using  EM  to  fit  the  mixture 
of  Gaussians  we  include  the  accuracy  of  H/Gauss  so  that 
the  benefits  of  using  a  mixture  model  can  be  evaluated.  At 
the  same  time,  the  increase  in  prediction  accuracy  due  to 
the  dual  representation  can  be  evaluated  by  comparing  to 
Gauss. 

Due  to  the  fact  that  H/Mix  increases  the  number  of  pa¬ 
rameters  that  need  to  be  fitted,  feature  selection  techniques 
are  bound  to  have  a  noticeable  impact.  This  is  evident 
in  the  results  obtained  for  H/Mix-FS  which,  as  mentioned 
above,  supports  a  primitive  form  of  feature  selection  (sec 
Figure  5(c)).  These  results  indicate  that  wc  may  achieve  bet¬ 
ter  performance  by  incorporating  a  feature  selection  mech¬ 
anism  into  the  classifier.  We  leave  this  as  a  topic  for  future 
re.search. 

7  CONCLUSIONS 

The  contributions  of  this  work  are  twofold.  First,  we  extend 
the  TAN  classifier  to  directly  model  continuous  attributes  by 
parametric  and  semiparametric  methods.  We  use  standard 
procedures  to  estimate  each  of  the  conditional  distributions, 
and  then  combine  them  in  a  structure  learning  phase  by 
maximizing  the  likelihood  of  the  TAN  model.  The  resulting 
procedure  preserves  the  attractive  properties  of  the  original 
TAN  classifier — we  can  learn  the  best  model  in  polynomial 
time.  Of  course,  one  might  extend  TAN  to  use  other  para- 
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Table  1 :  Experimental  Results.  The  first  four  column  describe  the  name  of  the  data  sets,  the  number  of  continuous  and  discrete  attributes, 
and  the  number  of  instances.  The  remaining  columns  report  percentage  classification  error  and  std.  deviations  from  5-fold  cross  validation 
of  the  tested  procedures  (see  text). 


Data  set 

Attr. 

C  D 

Size 

Disc 

Gauss 

Prediction  Errors 

Mix  H/Gauss 

H/Mix 

H/Mix-FS 

anneal-U 

6 

32 

898 

2.45  +- 1.01 

23.06  +-  3.49 

7.46  +-3.12 

10.91  +-  1.79 

4.12+-  1.78 

4.34+-  1.43 

australian 

6 

8 

690 

15.36  +-  2.37 

23.77 +-3.26 

18.70 +-4.57 

17.10 +-  2.83 

16.23 +-2.38 

15.80 +-  1.94 

auto 

15 

10 

159 

23.93  +-  8.57 

28.41  +-  10.44 

29.03 +-  10.04 

27.10 +-  8.12 

26.47  +-  8.44 

21.41  +-  4.27 

balance-scale 

4 

0 

625 

25.76  +-  7.56 

11.68 +-3.56 

9.60 +-2.47 

11.84 +-3.89 

13.92 +-  2.16 

13.92+-  2.16 

breast 

10 

0 

683 

3.22+- 1.69 

5.13 +-  1.73 

3.66+- 2.13 

3.22  +- 1.69 

4.34+-  1.10 

4.32  +-  0.96 

carsl 

7 

0 

392 

26.52  +-  2.64 

25.03 +-  7.11 

26.30 +-4.44 

25.28  +-  6.54 

24.27  +-  7.85 

25.79  +-  6.21 

cleve 

6 

7 

296 

18.92+- 1.34 

17.23 +-  1.80 

16.24+-3.97 

16.24 +-  3.97 

15.89 +- 3.14 

16.23 +-3.58 

crx 

6 

9 

653 

15.01  +- 1.90 

24.05  +-  4.44 

19.76  +-  4.04 

17.31  +-  1.60 

15.47 +-  1.87 

15.47 +- 2.09 

diabetes 

8 

0 

768 

24.35  +-  2.56 

25.66  +-  2.70 

24.74+-  3.74 

22.65  +-  3.21 

24.86  +-  4.06 

24.60  +-  3.45 

echocardiogram 

6 

1 

107 

31.82 +- 10.34 

28.23+  13.86 

30.13 +-  14.94 

29.18 +-  14.05 

29.18 +- 14.05 

30.95 +-  11.25 

flare 

2 

8 

1066 

17.63  +-  4.19 

17.91  +-  4.34 

17.63  +-  4.46 

17.91  +-  4.34 

17.63  +-  4.46 

17.63  +-  4.19 

german-org 

12 

12 

1000 

26.30  +-  2.59 

25.30 +-2.97 

25.60 +-  1.39 

25.70  +-  3.47 

25.20  +-  1.75 

26.60 +-2.27 

german 

7 

13 

1000 

26.20 +- 4.13 

25.20 +- 2.51 

24.60 +- 1.88 

25.10 +- 2.07 

25.30 +-  3.33 

25.70 +- 4.40 

glass 

9 

0 

214 

30.35  +-  5.58 

49.06 +-6.29 

48.13 +-  8.12 

32.23  +-  4.63 

31.30 +- 5.00 

33.16 +-5.65 

glass2 

9 

0 

163 

21.48  +-  3.73 

38.09 +-7.92 

38.09  +-  7.92 

34.39  +-  9.62 

31.27 +-9.63 

23.30 +-6.22 

hayes-roth 

4 

0 

160 

43.75  +-  4.42 

33.12 +-  11.40 

31.88 +- 6.01 

29.38  +-  10.73 

18.75 +-5.85 

14.38  +-  4.19 

heart 

13 

0 

270 

16.67 +-5.56 

15.56 +-5.65 

15.19  +-  5.46 

15.19  +-  3.56 

17.41  +-  4.65 

15.93 +-  5.34 

hepatitis 

6 

13 

80 

8.75  +-  3.42 

1 2.50 +-4.42 

10.00  +-  3.42 

12.50  +-7.65 

10.00 +-  5.59 

11.25 +-5.23 

ionosphere 

34 

0 

351 

7.70 +-2.62 

9.13 +- 3.31 

9.41  +-  2.98 

6.85  +-  3.27 

6.85  +-  3.27 

7.13 +-3.65 

iris 

4 

0 

150 

6.00 +-2.79 

2.00 +-2.98 

2.00 +-  2.98 

4.67+- 1.83 

4.67+- 1.83 

4.67+- 1.83 

liver-disorder 

6 

0 

345 

41.16 +-  1.94 

40.29 +- 5.16 

33.33  +-4.10 

36.52  +-  7.63 

30.43  +-  5.12 

41 .74 +-2.59 

pima 

8 

0 

768 

24.87  +-  2.82 

24.35+-  1.45 

24.35  +-  3.47 

22.92  +-  3.96 

25.52  +-  2.85 

24.48  +-  2.87 

post-operative 

1 

7 

87 

29.74  +- 13.06 

34.38  +- 10.09 

30.98 +-  11.64 

34.38  +-  10.09 

30.98 +-  11.64 

29.74+- 13.06 

metric  families  (e.g.,  Poisson  distributions)  or  other  semi- 
parametric  methods,  (e.g.,  kernel-based  methods).  The  gen¬ 
eral  conclusion  we  draw  from  these  extensions  is  that  if  the 
assumptions  embedded  in  the  parametric  forms  “match”  the 
domain,  then  the  resulting  TAN  classifier  generalizes  well 
and  will  lead  to  good  prediction  accuracy.  We  also  note 
that  it  is  straightforward  to  extend  the  procedure  to  select, 
at  learning  time,  a  parametric  form  from  a  set  of  parametric 
families. 

Second,  we  introduced  a  new  method  to  deal  with  differ¬ 
ent  representations  of  continuous  attributes  within  a  single 
model.  This  method  enables  our  model  learning  procedure 
(in  this  case,  TAN)  to  automate  the  decision  as  to  which  rep¬ 
resentation  is  most  useful  in  terms  of  providing  information 
about  other  attributes.  As  we  showed  in  our  experiments, 
the  learning  procedure  managed  to  make  good  decisions  on 
these  issues  and  achieve  performance  that  roughly  as  good 
as  both  the  purely  discretized  and  the  purely  continuous 
approaches. 

This  method  can  be  extended  in  several  directions.  For 
example,  to  deal  with  several  discretizations  of  the  same 
attributes  in  order  to  select  the  granularity  of  discretization 
that  is  most  useful  for  predicting  other  attributes.  Another 
direction  involves  adapting  the  discretization  to  the  particu¬ 
lar  edges  that  are  present  in  the  model.  As  argued  Friedman 
and  Goldszmidt  [9],  it  is  possible  to  discretize  attributes  to 
gain  the  most  information  about  the  neighboring  attributes. 
Thus,  we  might  follow  the  approach  in  [9]  and  iteratively 
readjust  the  structure  and  discretization  to  improve  the  score. 
Finally,  it  is  clear  that  this  hybrid  method  is  applicable  not 
only  to  classification,  but  also  to  density  estimation  and 
related  tasks  using  general  Bayesian  networks.  We  are  cur¬ 
rently  pursuing  these  directions. 
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Abstract 

Support  Vector  Machines  work  by  mapping 
training  data  for  classification  tasks  into  a 
high  dimensional  feature  space.  In  the  fea¬ 
ture  space  they  then  find  a  maximal  margin 
hyperplane  which  separates  the  data.  This 
hyperplane  is  usually  found  using  a  quadratic 
programming  routine  which  is  computation¬ 
ally  intensive,  and  is  non  trivial  to  imple¬ 
ment.  In  this  paper  we  propose  an  adap¬ 
tation  of  the  Adatron  algorithm  for  clas¬ 
sification  with  kernels  in  high  dimensional 
spaces.  The  algorithm  is  simple  and  can 
find  a  solution  very  rapidly  with  an  exponen¬ 
tially  fast  rate  of  convergence  (in  the  number 
of  iterations)  towards  the  optimal  solution. 
Experimental  results  with  real  and  artificial 
datasets  are  provided. 

Keywords:  Support  Vector  Machine,  Large  Margin  Clas¬ 
sifier,  Adatron,  Statistical  Mechanics 

1  INTRODUCTION 

Support  Vector  (SV)  machines  are  an  algorithm  in¬ 
troduced  by  Vapnik  and  co-workers  [5,  4]  theoretically 
motivated  by  VC  theory.  They  are  based  on  the  fol¬ 
lowing  idea:  input  points  are  mapped  to  a  high  dimen¬ 
sional  feature  space,  where  a  separating  hyperplane 
can  be  found.  The  algorithm  is  chosen  in  such  a  way 
to  maximize  the  distance  from  the  closest  patterns,  a 
quantity  which  is  called  the  margin. 

This  is  achieved  by  reducing  the  problem  to  a 
quadratic  programming  problem,  which  is  then  usu¬ 
ally  solved  with  optimization  routines  from  numerical 
libraries.  This  step  is  computational  intensive,  can  be 


subject  to  stability  problems  and  it  is  non  trivial  to 
implement. 

SV  machines  have  a  proven  impressive  performance  on 
a  number  of  real  world  problems  such  as  optical  char¬ 
acter  recognition  and  face  detection  [5,  6, 19, 17].  How¬ 
ever,  their  uptake  has  been  limited  in  practice  because 
of  the  mentioned  problems  with  the  current  training 
algorithms. 

An  analogous  problem  has  been  studied  in  the  Statis¬ 
tical  Mechanics  literature,  which  has  produced  a  num¬ 
ber  of  perceptron  learning  procedures  aimed  at  find¬ 
ing  maximal  margin  hyperplanes  in  the  input  space 
[11,  15,  13].  For  some  of  them  also  theoretical  guaran¬ 
tees  are  provided,  as  in  the  case  of  Adatron  [2],  where 
not  only  the  convergence  toward  the  optimal  solution 
has  been  proved,  but  also  an  exponential  rate  of  con¬ 
vergence  in  the  number  of  iterations. 

We  propose  a  ’’hybrid”  algorithm,  the  Kernel- Adatron 
(KA),  which  combines  the  implementational  simplicity 
of  Adatron  with  the  capability  of  working  in  nonlin¬ 
ear  feature  spaces  as  SV  machines  do.  By  introducing 
Kernels  into  the  algorithm  it  is  possible  to  maximize 
the  margin  in  the  feature  space,  which  is  equivalent 
to  nonlinear  decision  boundaries  in  the  input  space. 
The  algorithm  comes  with  all  the  theoretical  guaran¬ 
tees  given  by  VC  theory  for  large  margin  classifiers, 
as  well  as  the  convergence  properties  studied  in  the 
Statistical  Mechanics  literature. 

The  result  is  a  fast,  robust  and  extremely  simple  proce¬ 
dure  which  implements  the  same  ideas  and  principles 
as  SV  machines  at  much  smaller  cost.  Experimental 
results  are  provided  which  show  that  indeed  the  pre¬ 
dictive  power  of  our  algorithm  is  equivalent  to  that  of 
a  SV  machine.  Furthermore,  we  show  that  the  running 
time  can  be  orders  of  magnitude  faster. 
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2  SUPPORT  VECTOR  MACHINES 

Support  Vector  (SV)  machines  implement  complex 
(nonlinear)  decision  rules  in  terms  of  hyperplanes  in 
high  dimensional  spaces  and  were  originally  introduced 
by  Vapnik  and  co-workers  [24,  5,  10]. 

The  decision  function  realized  by  SV  machines  can 
conceptually  be  described  in  two  steps:  first  the  train¬ 
ing  points  are  mapped  by  a  nonlinear  function  ^  to 
a  high-dimensional  space  where  they  are  linearly  sep¬ 
arable.  Then  a  separating  hyperplane  is  found  which 
maximizes  its  distance  from  the  training  set,  called  the 
margin. 

Theoretical  results  exist  from  VC  theory  [24,  21], 
which  guarantee  that  such  solution  has  high  predic¬ 
tive  power,  in  the  sense  that  it  minimizes  an  upper 
bound  on  the  test  error  (a  complete  survey  covering 
the  generalization  power  of  SV  machines  can  be  found 
in  [3]). 

Let  S  =  {ixi,yi),{x2,y2),-,{xp,yp)}  be  a  sample  of 
points  Xi  ^  X  labelled  byj/i  G  {-1,+1}. 

Consider  a  hyperplane  defined  by  {w,6),  where  re  is  a 
weight  vector  and  6  a  threshold  value.  Let  S  =  (X,  Y) 
a  labeled  sample  of  inputs  from  X  that  has  empty 
intersection  with  the  hyperplane,  so  that 

7  =  min  [(a:,  w)  -t-  0|  >0 

We  call  this  distance  the  margin  of  the  hyperplane  w 
with  respect  to  the  sample  S. 

We  also  say  that  the  hyperplane  is  in  canonical  form 
with  respect  to  the  sample  if 

min  |(x, ru)  4-  =  1 

xex 

It  is  possible  to  prove  that  for  canonical  hyperplanes 
7  =  i/lklb 

The  following  theorem  holds: 

Theorem:  [21]  Suppose  inputs  are  drawn  indepen¬ 
dently  according  to  a  distribution  whose  support  is 
contained  in  a  ball  in  3?”  centered  at  the  origin,  of  ra¬ 
dius  R.  If  we  succeed  in  correctly  classifying  m  such 
inputs  by  a  canonical  hyperplane  with  jlwjl  =  I/7  and 
[01  <  R,  then  with  confidence  1-5  the  generalization 
error  will  be  bounded  from  above  by 


e(m,7)  =  ^  ^fclog  log  (32m)  -I- log 

where  k  =  [577i2^/7^J . 

The  quantity  which  upper  bounds  the  generalization 
error  does  not  depend  on  the  dimension  of  the  input 
space,  and  this  is  the  theoretical  reason  why  SV  ma¬ 
chines  can  use  high  dimensional  spaces  without  over¬ 
fitting. 

Two  main  ideas  (data-dependent  representation  and 
kernels)  make  it  possible  to  efficiently  deal  with  very 
high  dimensional  feature  spaces. 

The  first  is  based  on  the  identity: 

N  V 

'^Wi(l)i{x)  +  9  =  '^ak(j>ixk)(l>ix)  +  6 

i=l  fc=l 

which  provides  an  alternative,  data-dependent,  repre¬ 
sentation  of  the  hypothesis  itself,  and  the  other  is  the 
use  of  kernels: 

K{x',x)  =  '^4>i{x')(j)i{x) 

i 

which  are  equivalent  to  computing  the  dot  product 
of  the  images  of  two  vectors  in  the  feature  space  [1], 
provided  some  (nontrivial)  conditions  are  satisfied. 

A  common  choice  are  Radial  Basis  Functions  (RBF) 
such  as  gaussians, 

or  polynomial  kernels 

K{x,x')  =  {{x,x')  -t- 1)“* 
which  satisfy  such  conditions. 

The  use  of  the  kernels  instead  of  the  dot  product, 
in  the  data-dependent  representation  of  the  decision 
function,  automatically  provides  a  way  to  represent 
hyperplanes  in  a  feature  space  rather  than  in  the  in¬ 
put  space,  as  first  described  in  [1]. 

The  second  conceptual  step,  aimed  at  finding  the  large 
margin  hyperplane,  is  performed  in  SV  machines  by 
trasforming  the  problem  into  a  Quadratic  Program¬ 
ming  one,  subject  to  linear  constraints. 
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Kuhn- Tucker  theory  [24]  provides  the  framework  un¬ 
der  which  the  problem  can  be  solved  and  gives  the 
properties  of  the  solution. 

In  the  data-dependent  representation,  the  lagrangian 
p  p 

L  =  -  1/2  ^  aiajyiyjK{xi,Xj) 

i=l  i,j=l 

has  to  be  maximized  with  respect  to  the  Oj,  subject  to 
the  constraints 

p 

Oi  >  0  am  =  0 

i=l 

This  formulation  has  a  number  of  interesting  proper¬ 
ties,  characterizing  the  behaviour  of  the  optimal  hy¬ 
perplane. 

There  is  a  Lagrange  multiplier  Oj  for  each  training 
point.  Only  the  points  which  lie  closest  to  the  hy¬ 
perplane,  (on  parallel  hyperplanes  at  distance  7  from 
the  optimal  one)  have  Oj  >  0  and  are  called  support 
vectors.  All  the  others  have  Oj  =  0. 

This  means  that  in  the  representation  of  the  solution, 
only  the  points  which  are  closest  to  the  hyperplane 
contribute;  in  fact  they  represent  the  hypothesis  it¬ 
self,  (and  their  number  can  also  be  used  to  give  an 
independent  bound  on  its  reliability  [3]). 

The  resulting  decision  function  can  be  written  as: 
f{x)  =  sign  I  ^2/ia[’K(x,a:i)  -  0  J 

\i€SV  / 

where  a°  is  the  solution  of  the  constrained  maximiza¬ 
tion  problem  and  SV  represents  the  (indexes  of)  sup¬ 
port  vectors. 

Such  a  scheme  has  proved  to  be  very  resistent  to  over¬ 
fitting  in  many  classification  problems  [19, 6, 24).  How¬ 
ever  this  scheme  is  non  trivial  to  implement,  and  com¬ 
putationally  expensive.  Furthermore,  in  some  condi¬ 
tions,  it  can  suffer  from  numerical  conditioning  prob¬ 
lems. 

It  is  interesting  to  note  that  other  algorithms  which 
were  developed  with  different  motivations  have  been 
shown  to  use  a  similar  technique,  equivalent  to  map¬ 
ping  points  to  a  high  dimensional  feature  space  and 
separating  them  with  a  large  margin  hyperplane.  This 
is  the  case  for  Adaboost  [18],  and  for  Bayesian  Clas¬ 
sifiers  [7]  where  the  margin  distribution  over  all  the 


training  set  is  used  as  an  estimator,  rather  than  the 
margin. 

This  is  justified  by  a  theorem  from  Schapire  et  al. 
[18]  proving  that  the  fraction  of  training  points  which 
are  classified  with  large  margin  controls  the  predictive 
power,  and  that  valid  generalization  can  be  guaran¬ 
teed  even  when  few  points  lie  near  the  boundary  and 
hence  the  margin  of  the  sample  is  small. 

3  THE  KERNEL- ADATRON  (KA) 
ALGORITHM 

In  the  Statistical  Mechanics  approach  to  learning  [25], 
a  very  similar  problem  has  been  studied,  with  different 
motivations.  The  “perceptron  with  optimal  stability” 
has  been  the  object  of  extensive  theoretical  and  exper¬ 
imental  work,  [15,  11,  2],  and  a  number  of  simple  iter¬ 
ative  procedures  have  been  proposed,  aimed  at  finding 
hyperplanes  which  have  “optimal  stability”  or  -  in  our 
terms  -  maximal  margin. 

One  of  them,  the  Adatron,  comes  with  theoretical 
guarantees  of  convergence  to  the  optimal  solution,  and 
of  a  rate  of  convergence  exponentially  fast  in  the  num¬ 
ber  of  iterations  [2, 15],  provided  that  a  solution  exists. 

We  demonstrate  that  such  models  can  be  adapted, 
with  the  introduction  of  kernels,  to  operate  in  a  high¬ 
dimensional  feature  space,  and  hence  to  learning  non¬ 
linear  decision  boundaries.  This  provides  a  procedure 
which  emulates  SV  machines  but  doesn’t  need  to  use 
the  quadratic  programming  toolboxes. 

In  this  section  we  will  briefly  sketch  the  Adatron  algo¬ 
rithm,  and  we  will  list  the  theoretical  results  which  can 
be  proved  for  it  (in  the  Statistical  Mechanics  frame¬ 
work),  pointing  to  the  relevant  papers  for  the  proofs 
of  the  theorems.  Finally  we  will  show  how  it  is  pos¬ 
sible  to  introduce  the  kernels.  The  next  section  will 
be  devoted  to  experimental  comparisons  between  KA 
and  SV  machine,  and  to  benchmarking. 

The  Adatron  is  a  an  on-line  algorihm  for  learning 
perceptrons  which  has  an  attractive  fixed  point  cor¬ 
responding  to  the  maximal-margin  consistent  hyper¬ 
plane,  when  this  exists. 

By  writing  the  Adatron  in  the  data-dependent  repre¬ 
sentation,  and  by  substituting  the  dot  products  with 
kernels,  we  obtain  the  following  algorithm: 
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The  Kernel- Adatron  Algorithm. _ 

1.  Initialise  a*  =  1. 

2.  Starting  from  pattern  i  =  1,  for  labeled  points 
{xi,yi)  calculate  Zi  =  yi  OijyjK{xi,Xj). 

3.  For  all  patterns  i  calculate  7,  =  yiZi  and  execute 
steps  4  to  5  below. 

4.  Let  Sa^  =  77(1  -  7®)  be  the  proposed  change  to  the 
multipliers  a®. 

5.1.  If  (a®  +  5a®)  <  0  then  the  proposed  change  to 
the  multipliers  would  result  in  a  negative  a®.  Conse¬ 
quently  to  avoid  this  problem  we  set  a®  =  0. 

5.2  If  (a®  -t-  5a®)  >  0  then  the  multipliers  are  updated 
through  the  addition  of  the  5a®  i.e.  a®  <—  a®  -1-  5a®. 

6.  Calculate  the  bias  b  from 

6  =  i  (min  -Hmax  (^r)) 

where  zf  are  those  patterns  i  with  class  label  -fl  and 
z^  are  those  with  class  label  -1. 

7.  If  a  maximum  number  of  presentations  of  the  pat¬ 
tern  set  has  been  exceeded  then  stop,  otherwise  return 
to  step  2. 


The  kernel  K(x,x')  can  be  any  function  satisfying 
Mercer’s  condition,  in  particular  it  is  possible  to  use 
RBF  or  polynomial  kernels  given  in  section  2. 

Important  Remarks 

Using  results  reported  in  the  Statistical  Mechanics  lit¬ 
erature,  the  following  important  properties  of  the  Ada¬ 
tron  can  be  derived: 

1.  (Anlhauf  and  Biehl  [2])  Every  stable  point  for 
the  Adatron  algorithm  is  a  maximal  margin  point  and 
vice  versa. 

Proof  Sketch 

By  inserting  the  Kuhn-Tucker  conditions  for  the  max¬ 
imal  margin  (a*  >  0  —  7i  =  1,  a,  =  0  —  7i  >  1) 
in  the  Adatron  updating  rule  it  follows  that  the  opti¬ 
mal  margin  is  a  fixed  point.  Vice  versa  by  imposing 
Soi  =  0  Vi  the  Kuhn-Tucker  conditions  are  obtained. 

2.  (Anlhauf  and  Biehl  [2])  The  algorithm  converges 
in  a  finite  number  of  steps  to  a  stable  point  if  a  solution 
exists. 

Proof  Sketch  The  functional 


p 

L  =  ^  ] Oj  ~  1/2 aiaj7/i2/j(a;j, a:j) 

i=l  iyj 

can  be  shown  to  be  upper  bounded,  and  to  increase 
monotonically  at  each  updating  step  of  the  Adatron. 
So  it  has  to  find  a  fixed  point  in  a  finite  number  of 
steps. 

3.  (Opper  [16],  [15])  The  rate  of  convergence  to 
the  optimal  solution  follows  an  exponential  law  in  the 
number  of  iterations. 

The  proof  makes  use  of  replica  calculations  from  Sta¬ 
tistical  Mechanics  (and  the  standard  assumptions  of 
that  model  [25]). 

Note:  The  convergence  proof  relies  on  an  adequate 
choice  of  tj,  which  also  controls  the  speed  of  the  con¬ 
vergence  itself.  The  issues  regarding  the  choice  of  77 
cannot  be  discussed  here  for  lack  of  space,  but  we  ob¬ 
serve  that  the  theory  provides  an  interval  within  which 
a  valid  77  can  be  chosen.  Results  will  be  presented  else¬ 
where. 

4  EXPERIMENTAL  RESULTS 

We  have  evaluated  the  performance  of  the  KA  algo¬ 
rithm  with  gaussian  kernels  on  a  number  of  standard 
classification  datasets,  both  artificial  and  real.  The 
artificial  datasets  include  the  two-spirals  problem  [8], 
n-parity  [14],  mirror  symmetry  [14].  The  real  world 
data  include  the  sonar  classification  problem  [9],  the 
Wisconsin  breast  cancer  dataset  [23]  and  a  database  of 
handwritten  digits  collected  by  the  US  Postal  Service 
[12]. 

4.1  THE  TWO  SPIRALS  PROBLEM  AND 
n-PARITY 

For  the  two  spirals  problem  the  teisk  is  to  discriminate 
between  two  sets  of  points  which  lie  in  two  spirals  in 
a  plane. 

The  solution  found  by  the  KA  algorithm  is  illustrated 
in  Figure  2  and  compared  with  the  solution  provided 
by  a  kernel-perceptron,  i.e.  a  generic  hyperplane  in 
the  feature  space  (Fig.  1). 

The  diagrams  present  different  decision  functions;  in 
the  kernel-perceptron’s  case  the  small  margin  yields  a 
highly  non  smooth  boundary  while  for  the  KA  algo¬ 
rithm  a  smooth  and  centered  solution  has  been  found. 
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Useful  insight  about  the  differences  between  these  two 
learning  machines  can  be  obtained  by  observing  the 
margin  distribution  graphs  in  Fig.  3,  which  present  the 
cumulative  distribution  of  the  margins  of  all  individual 
training  points,  i.e.  the  fraction  of  patterns  (vertical 
axis)  which  have  a  margin  larger  than  a  given  value 
8  (horizontal  axis).  It  is  interesting  to  note  that  the 
effect  on  the  margin  distribution  of  the  training  in  KA 
is  similar  to  the  one  in  Adaboost,  discussed  in  [18]. 

The  solution  for  the  n-parity  problem  [14],  which  is 
hard  to  separate  for  neural  networks,  was  found  in  1 
epoch  for  n  =  3  and  n  =  6,  while  it  took  respectively 
3  and  5  epochs  to  maximise  the  margin. 


Figure  1:  Kernel-Perceptron  (small  margin)  clearly 
overfits 
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Figure  2:  Kernel-Adatron  learns  a  much  smoother 
boundary 
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Figure  3:  Cumulative  margin  distribution  for  kernel- 
perceptron  and  for  KA.  Note  the  scaling  of  the  mar¬ 
gins:  we  denote  with  100  the  null  margin  (points  on 
the  boundary). 


4.2  MIRROR  SYMMETRY 


In  the  mirror  symmetry  problem  [14]  the  output  y  is 
a  1  if  the  input  pattern  x  (with  components  from  {-1, 
-H})  is  exactly  symmetrical  about  its  centre,  other¬ 
wise  the  output  is  a  —1.  For  randomly  constructed 
input  strings  the  output  would  be  a  -1  with  a  high 
probability.  Consequently  the  labels  ±1  are  selected 
with  a  50%  probability  and  the  first  half  of  the  input 
string  is  randomly  constructed  from  components  in  {- 
1,  -1-1}  (both  selected  with  a  50%  probability)  and  the 
second  half  of  the  string  is  symmetrical  or  random  de¬ 
pending  on  the  target  value  given.  Generalisation  was 
evaluated  using  a  test  set  drawn  from  the  same  distri¬ 
bution  (eliminating  any  instances  for  which  the  input 
string  is  identical  to  a  member  of  the  training  set). 

In  Figure  4  we  plot  the  generalisation  error  on  the 
test  set  (100,000  examples  including  repetitions)  ver¬ 
sus  a  for  the  KA  algorithm  trained  to  200  epochs  with 
T]  =  1.0.  200  training  examples  were  used  with  in¬ 
put  strings  consisting  of  30  components.  The  gen¬ 
eralisation  error  passes  through  a  mimimum  between 
cr  =  4  —  5  with  a  maximum  generalisation  of  95.1%. 
To  compare  with  other  algorithms  in  a  machine  inde¬ 
pendent  way  we  have  implemented  all  algorithms  in 
MATLAB  (using  its  optimization  toolbox)  and  esti¬ 
mated  the  individual  speeds  using  FLOPS  (Table  1). 
We  see  that  the  KA  algorithm  is  substantially  faster 
that  Support  Vector  machines  while  also  having  a  com¬ 
parable  generalisation  performance  to  the  latter  (TR 
is  the  number  training  errors,  TS  the  number  of  test 
errors  on  a  set  of  100  patterns).  It  also  performs  much 
better  than  fc-nearest  neighbour  (fcNN)  on  the  test  set. 
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Table  1:  comparison  for  mirror  symmetry 


4.3  SONAR  CLASSIFICATION 

The  sonar  classification  problem  of  Gorman  and  Se- 
jnowski  [9]  consists  of  208  instances  formed  by  60 
analogue  inputs,  representing  returns  from  a  roughly 
cylindrical  rock  or  a  metal  cylinder,  equally  divided 
into  training  and  test  sets.  For  the  aspect-angle  de¬ 
pendent  dataset  [9]  they  trained  a  standard  back- 
propagation  neural  network  with  60  inputs  and  2  out¬ 
put  nodes.  Experiments  were  performed  with  up  to 
24  hidden  nodes  and  each  neural  network  was  trained 
with  300  epochs  through  the  training  set.  Their  results 
are  reproduced  in  Table  2. 


Figure  4:  Generalisation  error  (vertical  axis)  vs.  a 
(horizontal  axis):  mirror  symmetry  problem 
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Table  2:  Gorman  and  Sejnowki  results  for  sonar 


For  the  KA  algorithm  we  plot  a  against  generalisation 


error  in  Figure  5  and  the  best  generalisation  perfor¬ 
mance  is  95.2%  by  comparison.  The  KA  algorithm  is 
also  very  fast.  Figure  6  illustrates  the  approach  of  the 
margin  towards  1  (for  a  =  1.0  and  rj  =  1.0).  The  train¬ 
ing  error  fell  to  0  in  the  second  epoch  (it  was  0.077  at 
the  end  of  the  first  epoch).  We  also  show  the  generali¬ 
sation  error  versus  number  of  epochs  (Figure  7).  As  for 
mirror  symmetry  we  give  a  comparison  with  Support 
Vector  Machines  in  Table  3. 
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Table  3:  comparison  for  sonar  classification 


4.4  WISCONSIN  BREAST  CANCER 
DATASET 

The  Wisconsin  breast  cancer  dataset  contains  699  pat¬ 
terns  with  10  attributes  for  a  binary  classification  task 
(the  tumour  is  malignant  or  benign). 

This  dataset  has  been  extensively  studied  by  other 
authors.  CART  gives  a  generalisation  of  94.2%,  an 
RBF  neural  network  gave  95.9%,  a  linear  discriminant 
method  gave  96.0%  and  a  multi-layered  neural  network 
(trained  via  Back-Propagation)  96.6%  (all  the  results 
have  been  obtained  using  10-fold  cross-validation  [23]). 
Our  optimal  test  performance  was  of  99.48%,  which  is 
superior  to  the  previous  reported  results.  However  we 
regard  this  result  as  simply  indicating  that  we  are  com¬ 
parable  with  other  approaches,  as  this  difference  can 
also  be  due  to  other  factors  and  requires  further  inves¬ 
tigation.  Among  them  are  differences  in  the  handling 
of  instances  with  missing  values  (16  in  the  database), 
in  the  preprocessing  (we  have  removed  the  first  column 
of  the  database  reporting  the  patient’s  code  number, 
like  some  other  authors)  and  in  the  choice  of  a.  We 
note  that  the  test  error  is  insensitive  to  the  choice  of  a 
in  a  broad  interval,  as  can  be  seen  in  Fig.  11.  In  this 
diagram  we  give  a  plot  of  generalisation  error  versus 
(T  for  10-fold  cross  validation  on  the  699  instances  (50 
iterations  were  used  and  ??  =  1.0). 
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Figure  5:  Generalisation  error  of  KA  (vertical  axis)  vs. 
a  (horizontal  axis)  for  the  sonar  classification  problem. 


Figure  6:  Margin  (vertical  axis)  vs.  number  of  epochs 
(horizontal  axis)  for  sonar  classification  (a  =  1.0,  = 

1.0). 


Figure  7:  Generalization  error  (vertical  axis)  vs.  num¬ 
ber  of  epochs  (horizontal  axis)  for  sonar  classification 
(a  =  1.0,  t]  =  1.0). 


Furthermore,  for  a  particular  split  of  the  database  with 
550  training  examples  and  149  test  examples,  a  =  3.2 
and  1]  =  1.0,  we  give  plots  of  the  generalisation  error 
(Fig.  8),  margin  (Fig.  9)  (all  versus  number  of  epochs 
and  the  final  spectrum  of  a  values  (Figure  10). 

To  compare  the  computational  cost  of  KA  with  other 
classifiers  we  have  used  a  matlab  implementation  of 
them,  and  run  it  on  a  reduced  subset  of  the  database 
(199  training  and  168  testing  points)  using  the  FLOPS 
as  an  indication  of  the  algorithmic  complexity.  The  re¬ 
sults  are  reported  in  Table  4  and  indicate  that  KA  can 
achieve  about  the  same  generalization  performance  of 
SV  machines  at  a  cost  which  is  orders  of  magnitude 
smaller. 
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Table  4:  comparison  for  cancer  classification  (a  subset  has 
been  used  for  this  comparison 

4.5  US  POSTCODE  DATABASE 

The  benchmarking  of  classification  algorithms  of  the 
class  of  SV  machines  has  traditionally  been  performed 
using  the  database  of  handwritten  digits  from  US 
Postal  Codes  [12,  20]. 

This  dataset  consists  of  a  training  set  of  7,291  exam¬ 
ples  and  a  test  set  of  2,007.  Each  digit  is  given  by  a 
16  X  16  vector  with  components  which  lie  in  the  range 
-1  to  1.  In  this  experiment  we  have  performed  two- 
class  classification  i.e.  separating  a  particular  digit 
from  the  others.  To  find  suitable  values  for  a  the  train¬ 
ing  set  was  split  into  a  smaller  training  set  of  6,000 
examples  and  a  validation  set  of  1,291.  The  best  value 
of  a  was  found  by  evaluating  performance  on  the  val¬ 
idation  set  across  the  range  (1,10).  The  full  training 
set  of  7,291  was  then  used  with  the  selected  value  of 
cr  to  train  the  system  to  classify  each  digit. 

The  results  are  shown  in  Table  5  where  the  last  column 
shows  the  best  value  of  a  found  from  the  validation 
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Figure  8:  Generalization  error  (vertical  axis)  vs.  num¬ 
ber  of  epochs  (horizontal  axis)  for  cancer  classification 


Figure  9:  Margin  (vertical  axis)  vs.  number  of  epochs 
(horizontal  axis)  for  cancer  classification 


Figure  10:  Spectrum  of  a  values  for  the  550  patterns 
found  by  the  KA  algorithm  (cancer  classification  ex¬ 
periment 


Figure  11:  Generalisation  error  (vertical  axis)  vs.  cr 
(horizontal  axis)  for  cancer  classification  {r]  =  1.0) 

study.  The  other  columns  show  the  number  of  errors 
on  the  test  set  of  2,007  examples  for  the  KA  algorithm 
and  3  comparative  algorithms  as  reported  by  [20].  The 
latter  three  algorithms  are  an  RBF  neural  network,  a 
Support  Vector  Machine  (SVM)  and  a  hybrid  model  in 
which  the  support  vectors  found  by  the  SVM  are  used 
as  the  centers  of  receptive  fields  in  an  RBF  network 
[20]. 
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Table  5:  comparative  performance  on  the  USPS  database 
(number  of  errors  in  a  2007  points  test  set) 

The  performance  of  KA  is  comparable  with  the  other 
algorithms. 

5  CONCLUSIONS  AND  FUTURE 
WORK 

We  have  presented  an  algorithm  which  finds  maxi¬ 
mal  margin  hyperplanes  in  a  high  dimensional  feature 
space,  emulating  Vapnik’s  Support  Vector  machines. 

Experiments  performed  on  artificial  and  real  data  show 
that  the  generalization  performance  of  this  algorithm 
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is  comparable  with  that  of  SV  machines,  while  the 
computational  cost  of  finding  the  hypothesis  is  signif¬ 
icantly  smaller.  Also,  the  introduction  of  kernels  into 
the  Adatron  provides  a  very  simple,  compact  and  ro¬ 
bust  algorithm. 

Further  work  is  now  needed  to  introduce  the  capabil¬ 
ity  of  tolerating  training  errors,  so  that  the  machine 
can  deal  with  outliers  and  noisy  datasets,  following  the 
soft-margin  approch  used  in  SV  machines. 
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Abstract 


We  consider  multi-criteria  sequential  decision 
making  problems  where  the  vector-valued 
evaluations  are  compared  by  a  fixed  total  or¬ 
dering  of  the  vectors.  Conditions  for  the  op¬ 
timality  of  stationary  policies  and  the  Bell¬ 
man  optimality  equation  are  given  for  a  spe¬ 
cial,  but  important  class  of  problems,  when 
the  evaluation  of  policies  can  be  computed 
componentwise.  The  analysis  requires  special 
care  as  the  topology  introduced  by  pointwise 
convergence  and  the  order-topology  intro¬ 
duced  by  the  preference  order  are  in  general 
incompatible  several.  Reinforcement  learn¬ 
ing  algorithms  are  then  proposed  and  an¬ 
alyzed.  Preliminary  computer  experiments 
confirm  the  validity  of  the  derived  algo¬ 
rithms.  These  type  of  multi-criteria  problems 
are  most  useful  when  there  are  several  opti¬ 
mal  solutions  to  a  problem  and  one  wants  to 
choose  the  one  among  these  which  is  optimal 
according  to  another  fixed  criterion.  Possible 
application  in  robotics  and  repeated  games 
are  outlined. 


1  Introduction 

Scalar-valued  reinforcement  learning  (RL)  algorithms 
are  capable  of  solving  difficult  multi-step  decision 
problems  when  the  decision  criteria  can  be  expressed 
in  a  recursive  way  as  a  function  of  the  immediate  scalar 
reinforcement.  However,  there  are  some  important 
cases  when  there  is  no  simple  way  to  express  the  opti¬ 
mization  criteria  as  a  function  of  a  single  scalar  rein¬ 
forcement  value.  Consider,  for  example,  the  dilemma 


of  Buridan’s  ass.^  This  poor  animal  is  placed  at  equal 
distances  away  from  two  platefuls  of  food.  He  is  hun¬ 
gry  so  he  feels  like  going  to  one  of  the  plates.  However, 
if  he  goes  to  one  plate  then  there  is  a  chance  that  the 
dish  from  the  other  one  gets  stolen.  Since  the  ass  is 
greedy  (he  does  not  want  any  dish  to  be  stolen  away) 
he  will  never  move  and  will,  eventually,  die. 

In  this  example  the  ass  has  two  different  objectives 
competing  with  one  another.  The  first  one  is  to  eat 
so  that  he  can  stay  alive,  the  second  one  is  to  prevent 
the  dishes  from  being  stolen.  A  reasonable  compro¬ 
mise,  which  could  be  termed  the  “watchmen’s  com¬ 
promise”,  is  to  minimize  the  number  of  dishes  stolen 
per  unit  time  such  that  the  ass  manages  to  stay  alive: 
limT_,oo  ^  E^o  min  s.t.  limT-^oo  y  E^o  ^  ^ 
Rcrit-  Here  St  G  {0, 1}  is  the  indicator  of  whether  a 
plate  was  stolen  at  time  t,  Rt  =  {0, 1}  is  the  indicator 
of  whether  the  ass  was  consuming  at  time  t,  and  R„\t 
is  the  critical  amount  of  food  per  unit  time  needed  for 
staying  alive.  We  can  use  a  Tauberian  approximation 
to  the  above  criterion  [Ross,  1970]: 

oo  oo 

^7*5f->-min  s.t.  '^'y^Rt  >  R'^^,  (1) 

t=o  t=o 

where  0  <  7  <  1  is  a  value  sufficiently  close  to  1, 

^Buridan,  a  French  philosopher  of  the  mediaeval  pe¬ 
riod,  wrote  several  significant  commentaries  on  the  classi¬ 
cal  philosophical,  logical,  and  physical  works  of  Aristotle, 
including  the  Physics.  Actually,  he  never  referred  to  the 
infamous  ass  in  his  extant  writings,  but  this  concept  was 
invented  by  his  opponents  to  ridicule  his  use  of  animals 
in  the  examples  he  used  to  expound  his  theories  on  free 
will.  In  the  original  version  of  the  story  a  hungry  ass  stood 
between  two  haystacks,  both  of  which  were  equally  appe¬ 
tizing.  Unable  to  decide  from  which  stack  to  eat,  the  ass 
eventually  starved  to  death.  However,  the  example  in  this 
form  did  not  serve  well  our  purposes  so  we  felt  free  to  mod¬ 
ify  it  slightly. 
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ii'rit  =  iicrit/(l  -  t)-  ^  Since  the  decision  should  be 
made  on  the  basis  of  both  the  amount  of  food  eaten 
and  the  number  of  plates  stolen  and  both  of  these 
should  be  computed  separately,  the  normal  form  of 
reinforcement  at  time  t  will  be  {Rt,St)- 

Note  that  there  exists  other  ways  to  formalize  the 
dilemma  of  Buridan’s  ass.  Another  reasonable  com¬ 
promise,  e.g.,  is  to  maximize  the  weighted  sum  of  pro¬ 
tected  plates  and  the  amount  eaten:  7*  (^i  ( ^  ” 

St)  +  wiRt)  max,  where  wi,W2  >  0.  This  reduces 
the  problem  to  the  case  of  scalar-valued  reinforcement 
values.  Here,  we  do  not  want  to  argue  against  this  or 
other  reductions,  but  we  want  to  show  that  under  cer¬ 
tain  conditions  reinforcement  learning  algorithms  can 
be  extended  to  the  vector- valued  case  in  a  sensible  way. 

If  the  immediate  reinforcement  is  vector- valued  then  so 
will  be  the  long-term  reinforcement,  and,  specifically, 
the  evaluation  of  policies.  Then  the  comparison  of 
policies  becomes  problematic.  The  requirements  for  a 
meaningfull  comparioson  are  the  following:  we  want 
to  compare  any  pairs  of  policies  and,  in  particular,  we 
want  a  transitive  and  reflexive  comparison  operator. 
Several  approaches  will  be  shown  below.  No  matter 
how  the  policies  are  compared  the  notion  of  an  optimal 
policy  can  be  defined  at  this  point:  an  optimal  policy 
is  one  which  compares  favorably  with  any  other  policy. 

The  comparison  methods  are  best  illustrated  by  the 
above  problem.  Let  v„(x)  e  denote  the  evaluation 
of  policy  TT  in  state  x  with  v„(x)‘^  =  (v„^i(x),v„^2(x)), 
where  is  the  maximum  of  the  amount  of  food 

eaten  and  Rent,  while  u,r,2(a:)  is  the  number  of  plates 
stolen,  both  being  computed  when  policy  w  is  being 
used  beginning  from  state  x.  The  criterion  of  the  in¬ 
troduction  suggests  to  compare  any  pair  of  policies 
(7ri,7r2)  by  first  comparing  the  first  components  of 
their  respective  evaluation  functions:  tti  is  better  than 
'^2  if  >  v„2,2(x).  Since  evaluations  are  cut  at 

f?crit  we  may  expect  that  v^^^i(x)  and  v„^^2(^)  will 
be  equal  in  a  large  number  of  cases.  Then,  we  com¬ 
pare  the  second  components:  tti  is  better  than  7r2  if 
i^?ri,2(a:)  <  v„2,2(x)  (note  the  reversed  relational  sym¬ 
bol).  That  is,  among  policies  which  let  Buridan’s  ass 
stay  alive,  the  ones  with  a  smaller  number  of  stolen 
plates  are  preferred.  Since  here  the  policies  are  com¬ 
pared  on  the  basis  of  an  ordering  among  the  vector- 
components  of  the  policy  evaluation  functions,  this 
problem  is  one  example  of  ordinal  multi- criteria  deci- 

^In  order  to  simplify  the  presentation  we  implicitly  as¬ 
sume  here  that  the  decision  process  is  deterministic.  How¬ 
ever,  this  assumption  is  in  no  way  essential  to  the  subse¬ 
quent  developments  and  will  be  abandoned  later. 


sion  problems,  which  were  considered  a  long  time  ago 
by  Mitten  [1964]  and  Sobel  [1975]  in  terms  of  prefer¬ 
ence  relations  over  “partial  policies” .  In  order  the  sub¬ 
ordinate  criteria  to  be  useful  at  all,  the  optimization 
problem  corresponding  to  the  main  objective  should 
have  multiple  solutions.  This  can  be  achieved  using 
reduced  reinforcement-spaces.  As  an  interesting  exam¬ 
ple  note  that  Asimov’s  robots  obey  multi-criteria  rules 
of  this  form.  The  “laws  of  robotics”  claims  that  robots 
have  to  i)  defend  human  beings,  ii)  defend  themselves 
unless  this  conflicts  with  rule  ij;  and  iiij  serve  human 
beings  unless  this  conflicts  with  rules  iJ  or  ii).  This 
can  be  clearly  understood  as  an  ordinal  multi-criteria 
optimization  problem.  This  type  of  criterion  is  also 
related  to  solving  MDPs  in  parallel,  a  problem  sim¬ 
ilar  to  that  of  considered  by  Singh  emd  Cohn  [1997] 
and  empirically  Asada  et  al.  [1994]  for  football  playing 
robots.  In  this  latter  case  a  robot’s  primary  goal  could 
be  to  win  the  game,  while  it’s  subordinate  goal  could 
be  to  keep  clear  of  opponents  as  much  as  it  is  possible. 

Criterion  (1)  can  also  be  viewed  as  one  that  defines 
a  discounted  optimization  problem  subject  to  a  dis¬ 
counted  constraint.  Structural  properties  of  such  prob¬ 
lems  were  studied  extensively  in  the  control  and  oper¬ 
ations  reseeirch  literature,  e.g.  by  Prid  [1972],  Heyman 
and  Sobel  [1984],  Altman  and  Schwartz  [1991]. 

Another  approach  is  to  compare  any  pair  of  policies, 
(tti  )  ’’’2),  by  comparing  the  weighted  sum  of  the  compo¬ 
nents  of  theirs  evaluation  functions,  e.g.  wiu,ri,i(a^)  + 
v)2V„,,2(x)  and  -I-  ‘W2V„j,2(x)  (wi,W2  e  R). 

Note  that  this  criterion,  often  called  the  weighted  cri¬ 
terion  (see  Feinberg  and  Schwartz  [1995]  and  the  ref¬ 
erences  therein),  is  different  from  the  one  obtained  by 
the  linear  combination  of  the  immediate  reinforcement 
values  iff  the  discount  factors  of  the  two  components 
are  different. 

If  there  is  no  natural  weighing  of  components  then 
one  can  still  use  the  canonical  ordering  over  the  re¬ 
turn  space.  In  this  case,  however,  not  all  policies  will 
be  comparable  and  so  the  notion  of  optimality  needs 
to  be  adjusted.  The  natural  choice  is  then  Pareto- 
optimality:  a  policy  n  is  called  Pareto-optimal  in  state 
X  if  no  other  policy  can  majorize  it  at  x,  i.e.,  if  there 
is  no  policy  tt'  s.t.  Vn'{x)  >  v„{x).  A  policy  is  called 
Pareto-optimal  iff  it  is  Pareto-optimal  for  each  state. 
It  turns  out,  that  Pareto-optimality  is  equivalent  to 
weighted  optimality  with  appropriately  chosen  weights 
and  if  each  component  of  the  evaluation  is  computed 
as  the  total  discounted  rew'ard  for  some  reward  func¬ 
tion  [Feinberg  and  Schwartz,  1995,  Lemma  7.4].  In 
the  above  example,  assuming  that  the  amount  of  con- 
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sumed  food  is  not  truncated,  a  Pareto-optimal  pol¬ 
icy  would  be  one  for  which  there  is  no  other  policy 
that  would  allow  the  ass  to  consume  more  (than  the 
amount  ensured  by  the  Pareto-optimal  policy)  while 
assuring  a  smaller  number  of  stolen  plates  at  the  same 
time.  Pareto-optimality  has  been  studied  by  many  re¬ 
searchers  from  the  point  of  view  of  providing  condi¬ 
tions  which  ensure  the  existence  of  optimal  policies 
of  certain  forms  (stationary  policies  are  not  Pareto- 
optimal  in  general). 

Apparently  the  earliest  result  for  dynamic  vector¬ 
valued  models  are  those  of  Brown  and  Strauch  [1965], 
who  considered  abstract  return  spaces  having  a  general 
multiplicative  lattice  structure  and  who  showed  that 
the  “principle  of  optimality”  holds  for  finite-horizon 
problems.  Their  results  were  later  extended  to  infi¬ 
nite  horizon  problems  in  many  special  cases  (see,  e.g. 
[Feinberg,  1982,  Henig,  1983,  Feinberg  and  Schwartz, 
1994]). 

In  this  article  we  present  a  general  framework  based  on 
abstract  dynamic  programming  models,  and  which  is  a 
mixture  of  the  above  approaches  [Denardo,  1967,  Bert- 
sekas,  1977,  Littman  and  Szepesvdri,  1996,  Szepesvari, 
1998].  Namely,  we  suggest  an  approach  based  on  the 
notion  of  reinforcement-propagating  operators,  which 
now  act  on  function  spaces  defined  over  an  abstract 
return  space  with  a  given  ordering.  In  this  way  we 
can  address  constrained  problems,  lexicographic  crite¬ 
ria,  lattice  return  spaces  and  different  reinforcement 
propagation  scenarios  within  the  same  framework. 

The  article  is  organized  as  follows:  in  Section  2  we 
introduce  the  concepts  necessary  for  the  development 
and  list  some  basic  results  concerning  the  Bellman- 
optimality  equation  and  the  existence  of  optimal  sta¬ 
tionary  policies.  Reinforcement  learning  algorithms 
are  introduced  in  Section  3.  Some  computer  experi¬ 
ments,  illustrating  the  theory,  are  given  in  Section  4 
and  conclusions  are  drawn  in  Section  5. 

2  Abstract  ordinal  dynamic 
programming 

An  Abstract  Dynamic  Programming  (ADP)  problem 
can  be  given  as  a  5-tuple  {TZ,  X,  A,  A,  Q),  where  X  is 
the  state-space  of  the  decision  problem,  A  is  the  set 
of  actions,  A  :  X  A,  ^(o:)  are  the  actions  feasi¬ 
ble  in  state  x,  TZ  is  the  return  space  and  Q  :  TZ^ 

is  the  so-called  reinforcement-propagator  op¬ 
erator  [Szepesvari,  1998].  ^  In  order  to  explain  the 

®Here  A®  denotes  the  set  of  functions  mapping  B  into 


meaning  of  these  components  consider  the  problem  of 
Buridan’s  ass  once  again.  A  simplified  representation 
of  that  problem  could  be  the  following  :  the  ass’s  state 
assumes  three  values:  being  in  the  middle,  at  the  plate, 
or  at  the  right  plate.  The  plates  can  be  empty  or  full. 
A  state  of  the  decision  problem  is  composed  of  the 
position  of  the  ass,  and  the  states  of  the  plates.  So 
the  state  space  (X)  has  12  elements.  The  actions 
taken  by  the  ass  can  be  to  stay  at  that  position,  move 
to  the  left,  or  move  to  the  right;  so  the  action  space 
(A)  has  three  elements.  The  dynamics  is  given  by  the 
following  (stochastic)  rules:  the  move  actions  work  as 
intended.  If  the  ass  chooses  to  stay  at  a  full  plate  then 
that  plate  becomes  empty  (consuming),  if  the  ass  stays 
at  an  empty  plate  then  food  may  appear  at  that  plate 
according  to  some  fixed  stochastic  rule  and  if  the  ass 
stays  at  a  plate  (either  full  or  empty)  then  the  state 
of  the  other  plate  can  change  according  to  some  other 
fixed  (stochastic)  rule.  If  the  ass  is  in  the  middle  then 
none  of  the  plates  can  become  empty  in  the  next  step 
(the  ass  is  guarding  the  food).  The  dynamics  can  be 
summarized  by  a  random  mapping  t  :  X  x  A  X 
(or,  equivalently,  as  a  set  of  transition  probabilities). 
The  ass  is  considered  to  be  consuming  a  unit  food  if 
he  chooses  to  stay  at  a  full  plate.  If  xt  is  the  state 
at  time  t  then  the  reinforcement  streams  of 

Eq.  (1)  can  be  given  by  Rt  =  1  if  in  state  xt  the  ass 
is  at  a  full  plate  and  the  chosen  action,  at,  is  “stay”. 
Ef  =  0,  otherwise.  Therefore,  Rt  =  R(xt,  ot)  for  some 
function  R.  Further,  5*  =  1,  if  the  food  disappears 
from  a  plate  while  the  ass  is  at  the  other  plate,  oth¬ 
erwise  St  =  0.  That  is,  St  =  S{xt,at,xt+i),  where 
xt^.1  =  t{xt,at).  Let  us  define  the  evaluation  of  a  (de¬ 
terministic,  stationary)  policy,  n  :  X  A,hy 

OO 

t;„,i  (x)  =  min  ^Rcrit,  E  [^  7*i?(  |  xq  =  x]  j , 

t=0 

OO 

Vn,2{x)  =  E[^7‘5f  |xo  =  x] 

t-0 

where  E[-]  is  the  expectation  operator  underlying 
the  decision  process.  By  standard  arguments,  and 
since  min(i?,  E[^  -f  77])  =  min(iZ,  J5[^]  4-  E[tj])  = 
min(i?,  E[^]  4-  min(i?,  ^[jj]))  holds  if  J?  >  0  and 
are  nonnegative  random  variables,  one  can  show  that 
Vjr  can  be  written  recursively: 

VnA^)  =  rmn(^Rcrit,Rix,Tr{x)) + 
min(i?crit,7 

vex 


A. 
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v„,2{x)  =  p{x,  n{x),y)  {5(a:,  n{x),y)  +  'rv^aiv)}  ■ 
vex 

(2) 

Here  p(x,a,y)  =  P(y  =  t(x,a)).  Similar  recursions 
hold  for  non-deterministic,  Markovian,  and  even  for 
non-Mar kovian  policies  [Szepesvari,  1998].  Now,  if  one 
defines  Q  by 

(Qu)(a:,o)i  =  min^ilcrit,fi(3:,a) -f- min(ficrit, 

7  ^p(a:,a,2/)ui(y))), 
vex 

(Qv)(x,a)2  =  ^p(x,a,i/)  {S(3:,a,i/) -|-7U2(j/)} 

vex 

and  Tn- :  7?. 71  by  (T„v)(x)  —  (Qv)(x,7r(x)),  xe 
X,  then  we  see  that  Vt^  becomes  the  fixed  point  of 
T„.  Note  that  the  definition  of  Q  is  obtained  from  (2) 
by  systematically  replacing  7r(x)  by  a,  and  by  v, 
meaning  that  Q  provides  a  concise  summary  of  both 
the  state-  and  reinforcement-dynamics  of  the  decision 
process  in  an  abstract  form. 

Policies  are  compared  on  the  basis  of  their  evaluations. 
Since  now  v„{x)  €  71  =  is  vector- valued  we  need 
a  way  to  compare  pairs  of  vectors.  Therefore,  we  will 
assume  that  a  binary  relation  <  over  TZ  is  given  which 
is  reflexive,  transitive  and  trichotomous  (i.e.,  <  is  an 
ordering,  or  TZ  —  {TZ;  <)  is  a  lattice).  *  Buridan’s  ass 
requires  a  “reverse-2nd”  lexicographic  ordering:  r  <r' 
if  ri  <  r'l  or  if  r\  =  r[  and  r2  >  r'^  (here  the  com¬ 
ponents  of  r  and  r'  were  denoted  by  lower  indices). 
This  finishes  the  construction  of  the  ADP  describing 
the  problem-structure  of  Buridan’s  ass.  This  “reverse- 
2nd”  lexicographic  ordering  differs  from  lexicographic 
ordering  only  by  the  condition  on  the  second  compo¬ 
nents:  we  wrote  r2  >  r'2  instead  of  r2  <  rj.  For  conve¬ 
nience,  we  will  continue  with  considering  lexicographic 
ordering.  Lexicographic  ordering  (and  also  “reverse- 
2nd”  ordering)  satisfies  the  above  properties,  i.e.,  it  is 
an  ordering. 

In  order  to  facilitate  the  connection  with  RL  we  will 
define  the  notion  of  optimal  value  function  (instead 
of  relying  on  Pareto-optimality),  but  first  we  need  to 
assign  a  meaning  to  the  supremum  of  subsets  of  TZ:  for 
A  CTZ,  a  =  s.u.p.  A  is  a  value  such  that  for  all  c  >  A, 

‘‘a  binMy  relation  <  over  72  is  called  ij  reflexive  if  r  <  r 
for  any  r  e  TZ;  ii)  transitive  if  r,r'  and  r"  Me  such  that 
r  <  r'  and  r'  <  r"  then  r  <  r"  {r,r',r"  €  72);  and  Hi) 
trichotomous  if  for  any  pairs  (r,  r')  €  TZ  either  r  <  r'  or 
r'  <  r  (the  ordering  is  total)  and  if  both  relations  hold  then 


also  c  >  a  {a  >  b  is  defined  by  6  <  o,  and  a  >  A  is 
defined  as  a  >  a'  for  all  a'  €  A).  The  infimum  of  sets 
is  defined  analogously.  The  maximum  of  a  set  A  is 
defined  by 

a  =  m.a.x.  A,  iff  o  G  A  and  a  >  6,  V6  G  A, 

(3) 

the  minimum  could  be  defined  analogously.  A  lattice 
(72;  <)  is  said  to  be  complete  if  for  all  bounded  subsets 
A,  both  the  infimum  and  the  supremum  of  the  set 
exist.  Lexicographic  ordering  can  be  made  complete 
if  the  set  of  reals  R  is  replaced  by  the  set  of  extended 
reals,  R  =  {-00,  -l-oo}  U  R,  which  is  understood  with 
the  natural  topology.  Then  if  the  return  space  is  72  = 
R^,  the  supremum  o*  of  a  set  A  C  R^  can  be  defined 
in  the  standard  way  as  follows:  oj  =  supfoi  :  a  = 
(ai,02)^  G  A)  and  Oj  =  inf{a2n  :  On  =  (oin,a2n)^  e 

As.t. a\n  a*}- 

The  ordering  <  of  72  is  extended  to  functions  assuming 
values  in  72  in  the  usual  way:  for  v,w  £  72^  we  say 
that  u  <  tu  iff  for  all  y  EY,  v{y)  <  w(y)  holds.  Note 
that  the  induced  ordering,  <,  is  only  a  partial  ordering 
over  72^  (i.e.,  it  is  not  total). 

Equipped  with  the  notion  of  supremum  we  can  define 
the  optimal  reinforcement  function: 

t;*(x)  =  s.u.p.  t;„(x),  x£X.  (4) 

ttGII 

Here  H  denotes  a  fixed  set  of  policies.  We  will  consider 
the  case  when  H  equals  to  the  set  of  all  stationary 
policies.  A  policy  in  the  class  H  is  said  to  be  optimal 
if  v„  =  V*. 

Now,  we  can  answer  the  question  about  the  form  of  op¬ 
timal  stationary  policies  in  the  case  of  Buridan’s  ass. 
For  sure,  an  “optimal  ass”  would  indefinitely  repeat 
“guarding  steps”  (staying  in  the  middle)  and  “con- 
sumation  steps”.  It  should  also  be  clear  then  that 
the  exact  ratio  of  the  waiting  periods  would  depend 
on  the  value  of  Rcrit-  It  should  also  be  clear  that  for 
some  values  of  Rcrit  ah  stationary  policies  would  be 
suboptimal.  A  form  of  optimal  policies  for  this  class 
of  problems  can  be  found  in  [Feinberg  and  Schwartz, 
1995].  Note  that  if  one  extends  the  state  space,  so 
that  the  ass  has  counting-actions  with  a  limited  set  of 
numbers  (i.e.  if  the  ass  is  enabled  to  count  up  to  a 
fixed  maximum  number  of  steps)  and  if  the  ass  can 
choose  actions  randomly  then  optimal  policies  w.r.t. 
the  falls  set  of  policies  can  be  recovered  exactly.  So 
this  case  reduces  to  the  case  of  randomized  station¬ 
ary  policies.  The  following  theorem  restricts  the  set 
of  policies  further  to  deterministic  stationary  policies. 
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so  that  tractability  of  the  learning  problem  will  be  en¬ 
sured,  but  global  optimality  may  be  lost.  The  theorem 
is  proven  in  the  appendix. 

Theorem  2.1  Consider  a  finite 

ADP,  {TZ,X,A,A,Q),^  where  (i)  (72.; -I-,  A-,  ||  •  Htj)  is 
a  Banach-space  and  72  is  equipped  with  (ii)  a  com¬ 
plete  ordering  <  which  satisfies  the  following  countable 
transitivity  property;  (Hi)  if  rn  is  weakly  convergent 
in  72,  and  ro  <  ri  <  ra  <  . . .  r„  <  rn+i  <  ... 
then  ro  <  lim„_,.oor„.  Further,  assume  that  (iv) 
Q  :  ^  is  monotone:  Qv  <  Qw  when¬ 

ever  V  <  w,  v,w  G  72-^,  continuous  in  the  topologies 
induced  by  pointwise  convergence  overTZ^  andli^^^ , 
(v)  and  that  Q  is  a  contraction  w.r.t.  the  induced  max- 
norm^  II  •  lloo.T?,-  (vi)  Assume  that  T  :  72  72,  defined 

by 

iTv){x)  =  m.a.x  (Qu)(x,  a)  (5) 

a€A{x) 

has  a  unique  fixed  point  v'^,  and  lim„_>ooT"t)  = 
for  all  V  e  72-^  s.t.  ||u||oo,tc  <  oo.  Let  II  =  A^  be  the 
space  of  stationary  policies.  Then  (a)  for 

all  t:  (it  is  a  deterministic  stationary  policy)  and  u"*"  = 
V*,  so  Tv*  =  V*  (Bellman  optimality  equation);  (b)  if 
T^^V^  =  Tu"*",  i.e.,  if  TT  is  myopic  w.r.t.  v'^,  then  v.„  — 
V*  (myopic  policies  are  optimal);  (c)  if  T.n'V„  >  v,r 
then  v„i  >  Un-  (Howard’s  policy  improvement  routine 
is  valid). 

Operator  T,  as  defined  by  (5),  is  called  the  optimal 
value  operator. 

It  is  easy  to  check  that  countable  transitivity  holds 
for  sequences  of  R”  and  the  lexicographic  ordering. 
Note  that  contraction  arguments  cannot  be  used  since 
there  is  no  norm  over  R"  with  the  lexicographic  or¬ 
dering  for  which  the  m.a.x.  operator  would  be  a  non¬ 
expansion.  For  a  further  discussion  of  this  and  addi¬ 
tional  peculiarities  related  to  lexicographic  orderings 
see  [Gabor  et  ah,  1998]. 

Note  that  if  72  =  R”  with  the  lexicographic  ordering 
then  the  actions  at  which  the  maximum  is  reached  in 


®  An  ADP  (72,  X,  A,  A,  Q)  is  called  finite  if  both  X  and 
A  axe  finite.  The  finiteness  assumption  could  be  relaxed 
by  some  extra  work. 

®  A  sequence  r„  is  said  to  be  weakly  convergent  in  72  if  it 
is  convergent  in  the  topology  induced  by  the  vector  space 
structure  of  72. 

’^The  induced  maximum-norm  ||  •  ||oo,7?.  is  defined  by 

||u||oo,TC  =SUp^gz||u(2!)||TC. 


Eq.  (5)  can  be  computed  by  first  computing  the  sets 
Ai+i  =  {  a  e  Ai(x)  I  max  {Qf){x,  b)i  =  {Qf){x,  a)i  } 

beAi(x) 

(6) 

recursively  for  i  =  0, 1, 2, . . .  ,n  —  1,  with  Aq  =  A(x). 
For  convenience,  we  will  denote  the  action  sets  as  de¬ 
fined  above  by  Ai(Q,x)  when  Qf  is  replaced  by  any 
function  Q  G  72(X  x  A): 

Ao(Q,x)  =  A{x) 

Ai.\.i{^Q,x')  ~  ■{ fl  G  Aii^QjX')  I 

max  Q{x,b)i  =  Q{x,a)i}, 

beAi(Q,x) 

where  i  —  0, 1,2, . . .  ,n  —  1.  Then  (Ti;)(z)t+i  = 
maXog^((Q„^a:) (Q^)(a;,  o)i+i .  Now  we  show  that  T  has 
a  unique  fixed  point  and  T"?;  converges  to  this  fixed 
point  for  all  bounded  v  G  72^  provided  that  Q  sat¬ 
isfies  the  conditions  of  the  above  theorem  and  if  Q 
acts  componentwise,  i.e.,  if  {Qv)i  =  {Qw)i  whenever 

Vi  =  Wf. 

Theorem  2.2  Assume  that  Q  acts  componentwise 
and  that  conditions  (i)-(v)  of  Theorem  2.1  are  sat¬ 
isfied.  Then  also  condition  (vi)  is  satisfied  and  thus 
the  conclusions  of  Theorem  2.1  hold. 

Proof.  Fix  V  and  consider  the  first  component  of 
T"v.  Define  Ti  :  -A  R^  by  Tif  =  {Tf)i, 

where  /  =  (/, /2, •  •  •  , /n)  with  h,...  ,fn  being  ar¬ 
bitrary.  Ti  is  well  defined  and  is  a  contraction.  More¬ 
over,  (r"t;)i  =  T"ui  holds  for  all  n  G  N,  and  there¬ 
fore  (r"u)i  converges  to  the  unique  fixed  point  of 
Ti.  Similarly,  if  u  and  w  are  both  fixed  points  of 
T  then  ui  =  wi.  Let  us  denote  this  common  value 
by  vf) .  Now,  consider  (T"u)2.  Since  (r"'*'^u)2(a:)  = 
maxog^j(Q7’n„_3.)(QT”t;)(x,o)2,  and  since  Q  is  compo¬ 
nentwise,  Ai{QT'"'v,x)  depends  only  on  (r"u)i  which 
is  known  to  converge.  Therefore,  because  of  the  finite¬ 
ness  of  A,  for  n  large  enough  Ai{QT’’'v,x)  will  sta¬ 
bilize  at  some  set  Al{v,x).  Now,  since  the  oper¬ 
ator  u{x)  maXa^A-{v,x){Q'h){x,a)2  is  a  contrac¬ 
tion,  where  u  =  {v^ ,u,u' , . . .),  also  (T”'v)2  will  con¬ 
verge  to  some  value  (the  operator  is  well  defined 
since  Q  is  componentwise).  Moreover,  if  u  and  w 
are  both  fixed  points  of  T  then  ui  =  wi  and  thus 
Ai{Qu,x)  =  Ai{Qv,x){=  A*(a;))  for  all  x  G  X,  and 
so  U2  and  W2  are  both  the  fixed  points  of  the  con¬ 
traction  z  i->  maXag^j(a;)(Qi)(a:, 0)2  and  are  therefore 
equal.  Continuing  in  this  way  for  the  higher  indices 
we  get  the  proof  of  the  required  statement.  Q.u.e.d. 

The  above  theorem  shows  that  the  dilemma  of  Buri- 
dan’s  ass  is  indeed  in  the  realm  of  Theorem  2.1,  since 


202  Gabor,  Kalmar,  and  Szepesvdri 


the  appropriate  reinforcement  propagator  operator,  Q, 
acts  componentwise. 

Theorem  2.1  is  just  one  example  of  how  the  existence 
of  optimal  stationary  policies  can  be  ensured  in  multi¬ 
criteria  problems.  There  are  many  possible  extensions 
of  it,  but  these  are  outside  of  the  scope  of  the  present 
article. 

3  Learning  optimal  policies 

Since  most  convergence  proofs  for  RL  algorithms  rely 
on  contraction  arguments  the  generalization  of  results 
like  the  convergence  of  such  as  the  Adaptive  Real- 
Time  Dynamic  Programming  [Barto  et  al.,  1991],  Q- 
learning  [Watkins,  1990],  TD(A)  [Sutton,  1988]  are 
easy  to  obtain  for  vector- valued  MDPs  provided  that  T 
is  a  contraction^.  Unfortunately,  this  will  hold  rarely. 
Nevertheless  a  componentwise  analysis,  similar  to  the 
one  presented  at  the  end  of  the  previous  section,  will 
in  general  yield  the  desired  convergence  result. 

As  a  particular  example  consider  the  case  of  Q- 
learning.  Let  Q*  =  Qv*  be  the  optimal  action-value 
function.  Q-learning  solves  the  fixed  point  equation 
Q*  =  QSQ*,  {SQ){x)  =  m.a.x.i,g^(j,)  <5(1,6),  by  re¬ 
laxation  and  without  ever  estimating  Q.  In  the  case 
of  an  MDP  with  the  expected  discounted  total  cost 
Q-learning  takes  the  form 

Qt+i{xt,at)  =  (1  -  a((a;t,at))<5i(a:t,a()-)- 

oit{xt,at){Rt{xt,at,xt+i)  +  7  m^  <5j(a;(+i,6)}, 

with  <5t+i(a:,a)  =  Qt{x,a)  for  pairs  (x,a)  7^  (xt,at). 
The  relaxation  factor  (learning  rate)  0  <  Qt(a;(,af)  <  1 
is  gradually  decreased  towards  zero  so  that  the  vari¬ 
ance  of  the  estimates  are  reduced  and  (probability  one) 
convergence  can  be  achieved. 

A  raw  generalization  of  Q-learning  to  vector- valued  Q- 
learning  would  replace  the  immediate-reward  scalars 
(Rt)  in  the  above  equation  by  immediate-reward  vec¬ 
tors  and  “max”  by  “m.a.x.”  (remember  that  m.a.x. 
is  the  maximum  element  of  A  according  to  the  cho¬ 
sen  ordering  <  of  7?.  -  see  Eq.  (3)  for  the  definition  of 
m.a.x.).  For  simplicity,  consider  a  two-dimensional  re¬ 
turn  space  with  the  lexicographic  ordering  and  a  com¬ 
ponentwise  reinforcement  propagation  scenario  when 

®In  fact,  since  the  convergence  of  the  vast  majority 
of  RL  algorithms  follows  from  the  general  asynchronous 
contraction-mapping  theorem  of  [Littman  and  Szepesvari, 
1996]  (see  also  [Szepesvdri  and  Littman,  1997]),  it  is  suffi¬ 
cient  to  reproduce  the  proof  of  that  theorem.  It  turns  out, 
that  the  raw  generalization  of  that  proof  will  work  without 
any  problems  for  contractions.  However,  this  is  out  side  of 
the  scope  of  this  article. 


the  components  are  computed  by  some  expected  value 
criteria.  Proceeding  componentwise,  we  see  that  the 
update  equation  for  the  first  component  is  left  intact, 
but  the  update  of  the  second  component  becomes 

Qt+i,2ixt,at)  =  (1  -  at{xt,at)Qt,2{xt,at)  +  at{xt,ai) 

{Rt,2(.xt,at,xt+i)  +  'r  max  <5t,2(a;t-i-i,fc)}, 

beAi{Qt,xi) 

where  Ai{Q,x)  is  defined  by  Eq.  (6).  Note  that  in  the 
example  of  Buridan’s  ass  the  first  criterion  is  a  trun¬ 
cated  evaluation  criterion  and  thus  must  be  treated 
differently.  Unfortunately,  due  to  the  lack  of  space 
we  cannot  present  the  direct  learning  rules  for  this 
criterion,  but  we  note  here  that  this  rule  can  be  ob¬ 
tained  almost  entirely  automatically  if  one  tries  to 
estimate  Z*{x,a)  —  '^yP{x,a,y)Ql{x,a)  instead  of 
<51  [Szepesvari  and  Littman,  1997].  Note  that  know¬ 
ing  Z*  alone  is  insufficient  to  recover  <5* .  Therefore  ei¬ 
ther  one  estimates  R{x,  a)  and  then  computes  <51  (i,  a) 
or,  one  may  estimate  QJ  in  a  second  update  rule  us¬ 
ing  the  estimates  of  Z*  directly,  without  ever  estimat¬ 
ing  R{x,a).  This  latter  rule  will  work  only  if  R{x,a) 
is  deterministic.  The  convergence  of  these  algorithms 
follows  by  the  standard  proofs  completed  with  a  com¬ 
ponentwise  analysis. 

The  analogue  of  Q-learning  for  MDPs  with  the  max- 
imin  criterion,  proposed  by  Heger  [Heger,  1994,  1996], 
is  the  Q-hat  algorithm  given  by 

Qt+i{xt,at)  =  min{Qt{xt,at),Rt{xt,at,Xt+i)  + 

max  Qt{xt+ub)}. 

b^A 

This  algorithm  will  converge  to  the  optimal  Q-function 
if  <5o  >  <5*  (the  initial  estimate  is  optimistic).  The  raw 
generalization  replaces  “min”  and  “max”  by  “m.i.n.” 
and  “m.a.x.”,  respectively.  Unfortunately,  this  gener¬ 
alization  may  fail  to  converge  to  Q*  since  the  conver¬ 
gence  of  Q-hat  exploits  Qt  >  Q*  {t>  0)  and  this  may 
become  invalid  in  this  case.®  In  order  to  surmount  this 
problem  one  has  to  update  the  second  and  larger  index 
components  by  some  means  other  than  (5-hat  learning. 

It  is  natural  then  to  consider  adaptive  real-time  dy¬ 
namic  programming  algorithms.  For  maximin  prob- 

®This  can  be  shown  in  the  following  way:  Con¬ 
sider  again  77  =  with  the  lexicographic  ordering. 
Then  Qt+i,2{xt,at)  =  min{(5(, 2(x(, at), R(,2(a:(, a*, if+i)-!- 
7max6gy»,  (5(,2(xi+i,6)},  where  Ai  =  Ai{Qt,xt).  Notice 
that  Qf+i,2(x,a)  <  Qi, 2ix,  a)  for  all  (x,a)  €  U  so  if  once 
Q(,2(x,a)  <  Q5(x,a)  then  Qt,2{x,a)  cannot  converge  to 
Q5(x,a).  Here,  Ai{Qt,xt)  may  be  quite  different  from 
Ai(Q*,X()  which  means  that  Qf+i,2(xt, at)  may  become 
smaller  than  Q‘{xt,at)  even  if  <5t,2  =  <52,  depending  only 
on  the  values  o(  Qt.i. 
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lems  this  algorithm  builds  an  estimate  of  the  transition 
sets  T{x,a)  =  {y  e  X  \  p{x,  a,y)  >  0  }  using 

'I't+i{xt ,  at)  =  fli)  U  {xj-j-i} 

and  another  estimate  of  the  rewards  R{x,a,y)  by 
Rt+i{xt,at,xt+i)  =  Rt,  where  xt,at  are  the  state  and 
action  at  time  t,  and  where  Rt  €  is  the  immediate 
reward  vector  at  time  t.  The  value  function  estimate 
Vt{x)  e  is  updated  by  the  equation 

«^t+i(a;i)  =  m.a.x.  m.i.n.  (Rtixt,at,y)  + 'YVtiy)). 

O  y&Tt+iixt,at) 

Since  there  is  no  “optimistic  initialization”  condition 
here,  one  may  show  (using  a  componentwise  analy¬ 
sis)  that  this  algorithm  converges  to  optimality  if  some 
other  conditions  ensuring  “sufficient  exploration”  hold. 
Further  discussions  related  to  action-selection  strate¬ 
gies  ensuring  “sufficient  exploration”  in  minimax  prob¬ 
lems  can  be  found  in  [Gabor  et  ah,  1998]. 

4  Computer  simulations 

The  purpose  of  the  computer  simulations  was  twofold: 
to  demonstrate  that  the  theory  works  in  practice,  and 
to  provide  some  hints  on  the  rate  of  convergence  of  dif¬ 
ferent  algorithms.  The  ARTDP  algorithm  were  tried 
out  for  tic-tac-toe  with  lexicographic  ordering.  The 
first  criterion  prescribed  the  desire  to  win  (or  make 
a  draw)  and  the  second  to  finish  the  game  as  soon 
as  possible. The  action  selection  procedure  was  the 
greedy  policy  in  all  of  the  cases,  i.e.,  Qt{x,a{t))  = 
m.a.x.  Qt{x,  a)  for  each  t.  Several  opponents  were  tried 
whose  stategy  was  a  mixture  of  the  optimal  minimax 
policy  (computed  by  a  —  /^-pruning  with  ties  broken 
randomly)  and  a  totally  randomized  one.  The  de¬ 
gree  of  randomness  was  set  to  0,  0.25,  0.5,  0.75  and 
1,  so  that  the  first  opponent,  corresponding  to  ran¬ 
domness  0,  is  the  optimal  one,  while  the  last  one  is 
the  totally  randomized  one.  For  comparison  both  the 
multi-criteria  and  single  criterion  ARDTP  algorithms 
were  tried  (called  MC-ARTDP  and  ARTDP,  respec¬ 
tively.)  The  learner  started  the  game  in  each  trial. 
The  percent  of  wins  and  draws,  and  the  number  of 
steps  in  the  cases  of  won  or  drew  games  are  shown  in 

'^®The  first  component  of  the  reinforcement-vector  was 
-1-1  if  the  learner  won,  0  if  the  game  was  a  draw  and  —1  if 
he  lost  the  game.  The  second  component  was  unity  in  each 
step.  We  used  the  well  known  minimax  representation  of 
alternating  games  [see  e.g.  Littman  and  Szepesvari,  1996]. 
Note  that  by  a  simple  change  to  the  lexicographic  ordering 
one  may  consider  another  criterion  when  the  learner  mini¬ 
mizes  the  number  of  steps  only  when  starting  from  winning 
states,  otherwise  trying  to  mark  time. 


0 

0.25 

0.5 

0.75 

1 

ARTDP 

Win  or  draw 

0.73 

0.74 

0.74 

0.76 

0.74 

Steps 

3.55 

4.2 

4.18 

4.18 

4.19 

MC- 

Win  or  draw 

0.85 

1 

0.96 

1 

1 

ARTDP 

Steps 

3.59 

3.28 

3.29 

3.28 

3.28 

Table  1:  Results  of  exhaustive  testing.  Percents  of 
optimal  moves  learnt,  and  average  number  of  steps  to 
the  end  of  the  game  for  cases  when  the  learner  won 
are  shown  for  both  learners  learning  with  ARTDP  and 
MC-ARTDP.  In  the  first  raw  the  degree  of  randomness 
of  the  opponents  are  shown:  a  randomness  of  0  means 
an  optimal  opponent,  while  a  randomness  of  1  means  a 
perfectly  random  opponent.  The  results  suggest  that 
since  the  learners  do  not  explore,  a  complete  optimal 
policy  cannot  be  learned  against  the  perfect  opponent 
(just  part  of  the  game-tree  is  explored).  The  number  of 
steps  until  the  end  of  the  game  are  consistently  smaller 
for  MC-ARTDP  than  that  of  for  ARTDP.  Also  MC- 
ARTDP  can  win  a  larger  percent  of  games. 


Table  l.The  percents  are  computed  by  employing  an 
exhaustive  search,  i.e.,  the  percent  of  those  leaves  in 
the  full  reachable  game-tree  when  our  learner  did  not 
lose  the  game  was  measured.  As  expected,  the  num¬ 
ber  if  steps  until  the  end  of  game  is  lower  on  average 
for  the  MC-ARTDP  algorithm  than  that  of  for  the 
ARTDP  algorithm.  Note  that  this  comparision  is  not 
entirely  satisfactory  since  this  number  is  computed  just 
for  a  part  of  the  games  and  this  obviously  distorts  the 
results.  The  effect  of  this  can  be  observed  in  the  statis¬ 
tics  of  the  games  played  against  the  perfect  opponent: 
apparently  here  MC-ARTDP  needed  more  steps  than 
ARTDP,  but  since  MC-ARTDP  won  a  larger  percent 
of  games  this  increase  can  be  accounted  for  the  games 
that  MC-ARTDP  won  (or  drew)  and  ARTDP  lost.  In- 
triguingly,  the  results  also  show  that  MC-ARTDP  per¬ 
forms  better  than  ARTDP  in  all  of  the  cases,  i.e.,  it 
could  explore  a  larger  part  of  the  game-tree.  We  con¬ 
jectured  that  the  reason  for  this  is  that  MC-ARTDP 
uses  more  information  than  ARTDP.  In  particular, 
since  the  second  components  of  its  evaluation  function 
are  initialized  to  zero,  initially  unexplored  actions  will 
look  more  favourable  than  explored  ones,  meaning  that 
dependence  on  the  second  component  will  facilitate  ex¬ 
ploration.  To  confirm  the  conjecture  we  ran  another 
set  of  experiments  using  the  ARTDP  algorithm  and 
when  actions  were  chosen  based  on  one  of  the  following 
two  well-known  exploration  stategies:  the  Boltzmann- 
exploration  and  the  e-greedy  strategy  with  decaying 
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Figure  1:  Results  of  learning  with  the  one-criterion 
and  multi-criteria  ARTDP  algorithms  against  oppo¬ 
nents  of  different  strengths.  MC-ARTDP-0.25  and 
MC-ARTDP-0.75  label  the  curves  of  MC- ARTDP  for 
an  opponent  with  randomness  0.25  and  0.75  ,  respec¬ 
tively. 

exploration.^  In  this  case  ARTDP  yielded  compara¬ 
ble  results  to  that  of  MC-ARTDP,  thus  confirming  the 
hypothesis.^^ 

Exploration  has  a  price,  though.  The  more  ex¬ 
ploratory  actions  the  player  tries  the  larger  is  the  num¬ 
ber  of  games  lost  during  the  learning  trials.  In  order  to 
get  a  more  complete  picture  about  the  performances 
of  the  two  algorithms  we  have  measured  on-line  (or 
during-learning)  performance.  Results  are  shown  in 
Figures  1.  The  upper  subfigure  shows  the  percent  of 
plays  won  or  drew.  The  larger  the  convergence  speed 
to  1  is,  the  smaller  is  the  cost  of  exploration.  The  lower 
subfigure  depicts  the  number  of  steps  until  the  end  of 
the  game,  for  the  games  when  our  learner  actually  won. 
Both  figures  show  results  for  the  opponents  with  ran¬ 
domness  0.25  and  0.75  (results  for  the  other  cases  can 

^^The  e-greedy  exploration  stategy  chooses  the  best¬ 
looking  (greedy)  action  with  probability  1  —  e  and  chooses 
an  action  uniformly  randomly  from  the  rest  with  probabil¬ 
ity  e  [Thrun,  1992]. 

^^In  theory,  as  time  goes  to  infinity  both  algorithms  will 
converge  to  optimality.  So  the  worse  than  optimal  results 
should  not  be  considered  as  cases  when  the  algorithms  got 
stuck  in  “local  minima” . 


be  roughly  obtained  by  intra-  and  extrapolations  and 
are  not  shown).  Note  that  both  the  ARTDP  and  MC- 
ARTDP  learn  faster  against  weaker  opponents  which 
could  be  accounted  for  the  small  average  depth  of  vis¬ 
ited  game  tree  when  playing  against  a  weak  opponent. 
Note  that  the  learner  trained  against  a  weak  oppo¬ 
nent  will  probably  fail  to  win  over  a  strong  one,  and 
the  reverse  may  hold,  too;  in  order  to  learn  the  op¬ 
timal  minimax  strategy  the  opponents  should  not  be 
restricted  Also,  in  the  case  of  both  opponents  MC- 
ARTDP  learns  slightly  slower  (in  the  short-term)  but 
results  in  a  better  policy  in  the  medium-term.  More 
experiments  are  needed  to  analyze  these  findings. 

5  Conclusions 

We  have  considered  multi-criteria  decision  problems 
in  the  framework  of  abstract  dynamic  programming. 
The  reinforcements  were  assumed  to  be  vector-valued 
and  were  compared  by  a  total  ordering  defined  over  an 
appropriate  vector  space.  A  result,  showing  the  exis¬ 
tence  of  optimal  policies  was  derived  and  it  was  shown 
that  it  applies  to  lexicographic  ordering  when  the  rein¬ 
forcement  propagation  works  “componentwise”.  Next, 
reinforcement  learning  algorithms  were  derived  for  this 
case  and  we  have  argued  that  their  convergence  can  be 
proven  by  componentwise  analysis.  Experimental  re¬ 
sults  were  presented  to  illustrate  the  behavior  of  the 
algorithms.  In  the  future  we  plan  to  extend  the  re¬ 
sults  and  run  other  simulations  to  reinforce  the  utility 
of  multi-criteria  learning. 
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Appendix 

Here  we  prove  Theorem  2.1,  the  text  of  which  is  not 
repeated  here  because  of  lack  of  space.  Firstly,  we 

'^Since  the  opponents  are  randomized  (except  the  op¬ 
timal  opponent)  the  algorithms  will  eventually  converge 
to  optimality.  However,  the  convergence  rate  will  still  de¬ 
pend  on  the  degree  of  randomness  of  the  opponent.  The 
convergence  rate  will  depend  on  how  fast  can  the  part  of 
the  game-tree  which  is  accessible  for  an  optimal  player  be 
fully  explored.  For  opponents  with  higher  randomne.s.s  deep 
parts  can  hardly  be  accessed,  for  opponents  with  small  ran¬ 
domness  parts  that  follow  an  initial  sub-optimal  choice  will 
be  hard  to  explore. 
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shall  prove  that  ?;+ ,  the  unique  fixed  point  of  T,  ma¬ 
jorizes  the  optimal  value  function,  v*.  Fix  an  arbi¬ 
trary  policy  TT  and  observe  that  Tv-^  >  Since 

Tt^v-^  =  Vjc,  also  Tvt^  >  Vt,.  Prom  this,  and  because 
of  the  monotonicity  of  T  (which  holds  because  A  is 
finite),  we  obtain  >  Tv^  >  v-^.  Iterating  this 
indefinitely,  we  get  that  >  Tffv.^  >  •  •  •  > 

holds  for  all  n  e  N.  Thus,  is  monoton  increasing 
and  thus  (by  the  countable  transitivity  assumption) 
lim„^ooT"u,r  >  Vt;.  Now,  since  lim„_>oo 
so  ?;+  >  Vtt.  Since  tt  was  arbitrary,  it  follows  that 
>  V*  by  the  definition  of  the  s.u.p.  operator. 
Now,  let  TT  be  a  policy  which  is  myopic  w.r.t.  t;+: 

Since  Tv'^  =  U+,  so  ■  Now, 

since  is  the  unique  fixed  point  of  T.^  (T,r  is  a  con¬ 
traction  since  Q  is  a  contraction),  we  get  that  =v„. 
This  shows  that  u"*"  =  v*  and  that  tt  is  optimal.  In  or¬ 
der  to  prove  the  third  part  consider  a  pair  of  policies 
(tt,  tt')  s.t.  T^iv„  >  Vtt-  By  the  first  train  of  thoughts, 
we  get  that  >  u,r  is  a  monotone  increasing  se¬ 

quence,  so  that  v^'  =  lim„_^.oo  >  v„  holds,  too, 
thus  finishing  the  proof. 
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Abstract 


In  a  previous  work  we  have  presented  Cas¬ 
cade  Generalization,  a  new  general  method 
for  merging  classifiers.  The  basic  idea  of  Cas¬ 
cade  Generalization  is  to  sequentially  run  the 
set  of  classifiers,  at  each  step  performing  an 
extension  of  the  original  data  by  the  inser¬ 
tion  of  new  attributes.  The  new  attributes 
are  derived  from  the  probability  class  distri¬ 
bution  given  by  a  base  classifier.  This  con¬ 
structive  step  extends  the  representational 
language  for  the  high  level  classifiers,  relax¬ 
ing  their  bias.  In  this  paper  we  extend  this 
work  by  applying  Cascade  locally.  At  each 
iteration  of  a  divide  and  conquer  algorithm, 
a  reconstruction  of  the  instance  space  occurs 
by  the  addition  of  new  attributes.  Each  new 
attribute  represents  the  probability  that  an 
example  belongs  to  a  class  given  by  a  base 
classifier.  We  have  implemented  three  Local 
Generalization  Algorithms.  The  first  merges 
a  linear  discriminant  with  a  decision  tree,  the 
second  merges  a  naive  Bayes  with  a  deci¬ 
sion  tree,  and  the  third  merges  a  linear  dis¬ 
criminant  and  a  naive  Bayes  with  a  decision 
tree.  All  the  algorithms  show  an  increase  of 
performance,  when  compared  with  the  cor¬ 
responding  single  models.  Cascade  also  out¬ 
performs  other  methods  for  combining  clas¬ 
sifiers,  like  Stacked  Generalization  and  com¬ 
petes  well  against  Boosting,  with  statistically 
significant  confidence  levels. 


Keywords:  Multiple  Models,  Constructive  Induc¬ 
tion,  Merging  Classifiers. 


1  Introduction 

The  ability  of  a  chosen  algorithm  to  induce  a  good 
generalization  depends  on  how  appropriate  the  class 
model  underlying  the  algorithm  is  for  the  given  task. 
An  algorithm  class  model  is  the  representation  lan¬ 
guage  it  uses  to  express  a  generalization  of  the  ex¬ 
amples.  The  representation  language  for  a  standard 
decision  tree  is  the  DNF  formalism  that  splits  the  in¬ 
stance  space  by  axis-parallel  hyper-planes,  while  the 
representation  language  for  a  linear  discriminant  func¬ 
tion  is  a  set  of  linear  functions  that  split  the  instance 
space  by  oblique  hyper-planes.  Since  different  learn¬ 
ing  algorithms  employ  different  knowledge  representa¬ 
tions  and  search  heuristics,  different  search  spaces  are 
explored  and  diverse  results  are  obtained.  The  prob¬ 
lem  of  finding  the  appropriate  bias  for  a  given  task 
is  an  active  research  area.  We  can  consider  two  main 
lines:  on  one  hand  methods  that  try  to  select  the  most 
appropriate  algorithm  for  the  given  task,  for  instance 
Schaffer’s  selection  by  Cross-Validation,  and  on  the 
other  hand,  methods  that  combine  predictions  of  dif¬ 
ferent  algorithms,  for  instance  Stacked  Generalization 
[25].  This  work  follows  the  second  research  line.  In¬ 
stead  of  looking  for  methods  that  fit  the  data  using  a 
single  representation  language,  we  present  a  family  of 
algorithms,  under  the  generic  name  of  Cascade  Gen¬ 
eralization,  whose  search  space  contains  models  that 
use  different  representation  languages.  Cascade  gen¬ 
eralization  was  first  presented  in  [14].  It  performs  an 
iterative  composition  of  classifiers.  At  each  iteration 
a  classifier  is  generated.  The  input  space  is  extended 
by  the  addition  of  new  attributes.  These  are  in  the 
form  of  a  probability  class  distribution  which  are  ob¬ 
tained,  for  each  example,  by  the  generated  base  classi¬ 
fier.  The  language  of  the  final  classifier  is  the  language 
used  by  the  high  level  generalizer.  This  language  uses 
terms  that  are  expressions  from  the  language  of  low 
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level  classifiers.  In  this  sense,  Cascade  Generalization 
generates  a  unified  theory  from  the  base  theories. 

Here  we  extend  the  work  presented  in  [14],  by  applying 
Cascade  locally.  In  our  implementation.  Local  Cascade 
Generalization  generates  a  decision  tree.  The  experi¬ 
mental  study  shows  that  this  methodology  usually  im¬ 
proves  both  accuracy  and  theory  size  with  statistical 
significance  levels. 

The  next  section  presents  the  framework  of  Cascade 
Generalization.  In  section  3  we  define  a  new  family  of 
algorithms  that  apply  Cascade  Generalization  locally. 
In  section  4  we  review  previous  work  in  the  area  of 
multiple  models.  In  section  5,  we  perform  an  empirical 
study  using  UCI  data  sets.  The  last  section  presents 
an  analysis  of  the  results  and  concludes  the  paper. 

2  Cascade  Generalization 

Consider  a  learning  set  D  =  {x^,yn)  n  =  1,  ...,N, 
where  =  [xi,...,!^]  is  a  multidimensional  input 
vector,  and  i/„  is  the  output  variable.  Since  the  fo¬ 
cus  of  this  paper  is  on  classification  problems,  j/„ 
takes  values  from  a  set  of  predefined  values,  that  is 
j/n  €  {Cli,...,Clc},  where  c  is  the  number  of  classes. 
A  classifier  is  a  function  that  is  applied  to  the  train¬ 
ing  set  D  in  order  to  construct  a  model  ^{D).  The 
generated  model  is  a  mapping  from  the  input  space 
X  to  the  discrete  output  variable  Y.  When  used  as  a 
predictor,  represented  by  it  assigns  a  y  value 

to  the  example  x.  This  is  the  traditional  framework 
for  classification  tasks.  Our  framework  requires  that 
the  predictor  ^{x,D)  outputs  a  vector  representing 
conditional  probability  distribution  [pi,..., pc],  where 
Pi  represents  the  probability  that  the  example  x  be¬ 
longs  to  class  i,  i.e.  P{y  =  Cli\x).  The  class  that  is 
assigned  to  the  example  x,  is  the  one  that  maximizes 
this  last  expression.  Most  of  the  commonly  used  clas¬ 
sifiers,  such  as  naive  Bayes  and  Discriminant,  classify 
each  example  in  this  way.  Other  classifiers,  for  ex¬ 
ample  C4.5,  have  a  different  strategy  for  classifying 
an  example,  but  it  requires  small  changes  to  obtain  a 
probability  class  distribution. 

We  define  a  constructive  operator  #(£)', Sr(x,P)). 
This  operator  has  two  input  parameters:  a  data  set 
D'  and  a  predictor  SJ(x,  D).  The  classifier  S  generates 
a  theory  from  the  training  data  D.  For  each  exam¬ 
ple  X  £  D',  the  generated  theory  outputs  a  probabil¬ 
ity  class  distribution.  For  all  the  examples  in  D'  the 
operator  #  concatenates  the  input  vector  x  with  the 
output  probability  class  distribution.  The  output  of 
#(£)',  9'(x,D))  is  a  new  data  set  D" .  The  cardinal¬ 


ity  of  D"  is  equal  to  the  cardinality  of  D'  (i.e.  they 
have  the  same  number  of  examples).  Each  example 
in  X  G  D”  has  an  equivalent  example  in  D',  but  aug¬ 
mented  with  c  new  attributes.  The  new  attributes  are 
the  elements  of  the  vector  of  class  probability  distri¬ 
bution  obtained  when  applying  classifier  Q{x,D)  to 
the  example  x.  Cascade  generalization  is  a  sequential 
composition  of  classifiers,  that  at  each  generalization 
level  applies  the  $  operator.  Given  a  training  set  L, 
a  test  set  T,  and  two  classifiers  ,  and  Q2 ,  Cascade 
generalization  proceeds  as  follows: 

Using  classifier  Si,  generates  the  Leveh  data: 

Levehtrain  =  $(L,  Si(x,  L)) 

Leue/i  test  =  $  (T,  Si  (x ,  L) ) 

Classifier  S2  learns  on  Leveh  training  data  and  clas¬ 
sifies  the  Leveh  test  data: 

Levehtrain)  for  each  x  G  Levehtest 

Those  steps  perform  the  basic  sequence  of  a  cascade 
generalization  of  classifier  S2  after  classifier  Si.  We 
represent  the  basic  sequence  by  the  symbol  V. 

The  previous  composition  could  be  shortly  represented 
by: 

S2VS1  =  S2(x,  Leuetitrom)  for  each  x  G  Levehtest 

which  is  equivalent  to: 

S2VS1  =  S2(x,  $(L,  Si(x',L)))  for  each 
xG  $(r,Si(x",L)) 

This  is  the  simplest  formulation  of  Cascade  General¬ 
ization.  Some  possible  extensions  include  the  compo¬ 
sition  of  n  classifiers,  and  the  parallel  composition  of 
classifiers. 

A  composition  of  n  classifiers  is  represented  by: 
SnV3„_iV9„_2-V9'i 

In  this  case.  Cascade  Generalization  generates  n-1  lev¬ 
els  of  data.  The  high  level  theory,  is  that  one  given  by 
the  classifier. 

A  variant  of  cascade  generalization,  which  includes 
several  algorithms  in  parallel,  could  be  represented  in 
this  formalism  by: 

^n+l  V[9:i ,  ...,  9f„]  = 

9„+i(x,$(L,  [Qi{x',L),  ...,9„(x'‘',L)])) 
for  each  x  G  #(T,  [9'i(x",L),  ...,9:„(x",L)]) 


208  Gama 


The  algorithms  9i,  run  in  parallel.  The  oper¬ 

ator  #(L,  [3i(x',L),...,3'„(x',L)])  returns  a  new  data 
set  L'  which  contains  the  same  number  of  examples  as 
L.  Each  example  in  L'  contains  n  *  c  new  attributes, 
where  c  is  the  number  of  classes.  Each  algorithm  in 
the  set  O'!,  contributes  with  c  new  attributes. 

3  Local  Cascade  Generalization 

Most  of  Machine  Learning  algorithms  for  supervised 
learning  use  a  divide  and  conquer  strategy  that  at¬ 
tacks  a  complex  problem  by  dividing  it  into  simpler 
problems  and  recursively  applies  the  same  strategy 
to  the  subproblems.  Solutions  of  sub-problems  can 
be  combined  to  yield  a  solution  of  the  complex  prob¬ 
lem.  This  is  the  basic  idea  behind  well  known  decision 
tree  based  algorithms:  IDS  (Quinlan,  1984),  ASSIS¬ 
TANT  (Kononenko  et  all,  1987),  CART  (Breiman  et 
all,  1984),  C4.5  (Quinlan,  1993),  etc.  The  power  of 
this  approach  comes  from  the  ability  to  split  the  hyper¬ 
space  into  subspaces  and  fit  each  subspace  with  differ¬ 
ent  functions.  In  our  previous  work  [14]  we  have  shown 
that  Cascade  significantly  improves  the  performance 
of  this  type  of  learning  algorithms.  In  this  paper  we 
explore  the  applicability  of  Cascade  on  the  problems 
and  subproblems  that  a  divide  and  conquer  algorithm 
must  solve.  The  intuition  behind  this  hypothesis  is  the 
same  as  behind  any  divide  and  conquer  strategy.  The 
relations  that  can  not  be  captured  at  global  level  can 
be  discovered  on  the  simpler  subproblems. 

Local  cascade  generalization,  is  a  composition  of  al¬ 
gorithms  that  is  performed  for  each  task  when  build¬ 
ing  the  classifier.  At  each  iteration  of  a  divide  and 
conquer  algorithm,  local  cascade  generalization  will  be 
performed  by  applying  the  $  operator.  The  effect  is 
that  the  input  space  is  reconstructed  by  the  insertion 
of  the  new  attributes.  These  new  attributes  are  prop¬ 
agated  down  to  the  subtasks  that  the  algorithm  might 
consider.  In  this  paper  we  restrict  the  use  of  local  Cas¬ 
cade  Generalization  to  decision  tree  based  algorithms. 
However,  it  would  be  possible  to  use  it  with  any  divide 
and  conquer  algorithm.  Figure  1  presents  the  general 
algorithm  of  local  Cascade  Generalization,  applied  to 
a  decision  tree. 

When  growing  the  tree,  at  each  decision  node  new 
attributes  are  computed  by  applying  the  $  operator. 
The  new  attributes  that  are  created  there  are  propa¬ 
gated  down  the  tree.  The  number  of  new  attributes  is 
equal  to  the  number  of  classes  of  the  examples  that  fall 
at  this  node.  At  different  levels,  the  algorithm  consid¬ 
ers  data  sets  with  different  number  of  attributes  and 


Input:  A  data  set  D,  a  base  classifier  Q 
Output:  A  decision  Tree 
Function  CGtree(D,  9) 

IF  stop  criteria(D)  =  TRUE 

return  a  Leaf  with  class  probability  distribution 
D'  =  $(D,9(f,  D)) 

Choose  the  attribute  that  maximizes 

splitting  criterion  on  D' 

For  each  partition  of  examples  based  on 
chosen  attribute  values 
Treci  —  CGtree(Z)[,  Q) 
return  Tree  as  a  decision  node  based  on 
chosen  attribute,  storing  9(D) 
and  descendants  Trecj 

End 

Figure  1:  Local  Cascade  Algorithm  based  on  a  Deci¬ 
sion  Tree 

classes.  Deeper  nodes  contain  an  increasing  number 
of  attributes.  This  could  be  a  disadvantage  of  the  sys¬ 
tem,  but  the  number  of  new  attributes  is  not  constant. 
As  the  tree  grows  and  the  classes  are  discriminated, 
deeper  nodes  also  contain  examples  from  a  decreasing 
number  of  classes.  This  means  that  as  the  tree  grows 
the  number  of  new  attributes  decreases. 

In  order  to  be  applied  as  a  predictor,  any  CGTree 
must  store,  at  each  node,  the  model  generated  by  the 
base  classifier  using  the  examples  that  fall  at  this  node. 
When  classifying  a  new  example,  the  example  tra¬ 
verses  the  tree  in  the  usual  way,  but  at  each  decision 
node  it  is  extended  by  the  insertion  of  the  probability 
class  distribution  provided  the  base  classifier  predictor 
at  this  node. 

In  the  framework  of  local  cascade  generalization, 
we  have  developed  a  CGLtree,  that  uses  the 
^{D,Discrim{x,D))  operator  in  the  constructive 
step.  Each  internal  node  of  a  CGLtree  contains  a  dis¬ 
criminant  function.  This  discriminant  function  is  used 
to  build  new  attributes.  For  each  example  x,  the  value 
of  a  new  attribute  Aj  is  computed  using  the  probabil¬ 
ity  p{Ci\x)  which  is  given  by  the  linear  discriminant 
function.  At  each  decision  node,  the  number  of  new 
attributes  built  by  CGLtree  is  always  equal  to  the 
number  of  classes  taken  from  the  examples  that  fall 
at  this  node.  We  use  the  following  heuristic:  we  only 
consider  a  classi  if  the  number  of  examples,  at  this 
node,  belonging  to  classi  is  greater  than  N  times  the 
number  of  attributes^ .  By  default  N  is  3.  This  implies 

^This  heuristic  was  suggested  by  Breiman  et  al.[3] 
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that  at  different  nodes,  different  number  of  classes  will 
be  considered  and  a  different  number  of  new  attributes 
is  added. 

In  our  empirical  study  we  have  used  two  other 
algorithms  that  locally  apply  Cascade  Generaliza¬ 
tion.  CGBtree  that  uses  as  constructive  operator 
^{D,naiveBayes{x,  D)),  and  CGBLtree  that  uses  as 
constructive  operator: 

^{D,[naiveBayes{x,D),Discrim{x,D)]).  In  all 
other  aspects  these  algorithms  are  similar  to  CGLtree. 

There  is  one  restriction  to  the  application  of  the 
^{D' ,^{x,D))  operator:  the  O'  classifier  must  return 
a  probability  class  distribution  for  each  x  €  D'.  Any 
classifier  that  satisfies  these  requisites  could  be  ap¬ 
plied.  It  is  possible  to  imagine  a  CGTree,  whose  in¬ 
ternal  nodes  are  trees  themselves.  For  example,  small 
modifications  to  C4.5^,  will  allow  the  construction  of  a 
CGTree  whose  internal  nodes  are  trees  generated  by 
C4.5. 

4  Related  Work 

With  respect  to  the  final  model,  there  are  clear  similar¬ 
ities  between  CGLtree  and  Multivariate  trees  [5,  15]. 
Any  multivariate  tree  is  topologically  equivalent  to  a 
three-layer  inference  network  [18].  The  constructive 
ability  of  our  system  is  similar  to  the  Cascade  Corre¬ 
lation  Learning  architecture  [11].  Also  the  final  model 
of  CGBtree  is  related  with  the  recursive  naive  Bayes 
presented  in  [17].  In  a  previous  work  [13],  we  have 
compared  system  Ltree,  similar  to  CGLtree,  with  Ocl 
[19]  and  LMDT  [5].  The  focus  of  this  paper  is  on 
methodologies  for  combining  classifiers.  As  such,  we 
review  other  methods  that  generate  and  combine  mul¬ 
tiple  models. 

4.1  Combining  Classifications 

We  can  consider  two  main  lines  of  research.  One  group 
includes  methods  where  all  base  classifiers  are  con¬ 
sulted  in  order  to  classify  a  query  example.  The  other 
includes  methods  that  characterize  the  area  of  exper¬ 
tise  of  the  base  classifiers  and  for  a  query  point  only 
ask  the  opinion  of  the  experts.  Voting  is  the  most  com¬ 
mon  method  used  to  combine  classifiers.  As  pointed 
out  by  Ali  and  Pazzani  [1],  this  strategy  is  motivated 
by  the  Bayesian  learning  theory  which  stipulates  that 
in  order  to  maximize  the  predictive  accuracy,  instead 
of  using  just  a  single  learning  model,  one  should  ide¬ 
ally  use  all  models  in  the  hypothesis  space.  The  vote 

^Two  different  methods  are  presented  in  [14,  23]. 


of  each  hypothesis  should  be  weighted  by  the  poste¬ 
rior  probability  of  that  hypothesis  given  the  training 
data.  Several  variants  of  the  voting  method  can  be 
found  in  the  machine  learning  literature.  Prom  uni¬ 
form  voting  where  the  opinion  of  all  base  classifiers 
contributes  to  the  final  classification  with  the  same 
strength,  to  weighted  voting,  where  each  base  classi¬ 
fier  has  a  weight  associated,  that  could  change  over 
the  time,  and  strengthens  the  classification  given  by 
the  classifier. 

Ortega  [20]  presents  the  “Model  Applicability  Induc¬ 
tion”  approach  for  combining  predictions  from  mul¬ 
tiple  models.  The  approach  consists  of  learning  for 
each  available  model  a  referee  that  characterize  situ¬ 
ations  in  which  each  of  the  models  is  able  to  make 
correct  predictions.  In  future  instances  these  referees 
are  first  consulted  to  select  the  most  appropriate  pre¬ 
diction  model  and  the  prediction  of  the  selected  model 
is  then  returned. 

4.2  Generating  different  models 

Several  methods  for  generating  multiple  models  ap¬ 
pear  in  the  literature.  Breiman  [3]  proposes  bagging, 
that  produces  replications  of  the  training  set  by  sam¬ 
pling  with  replacement.  Each  replication  of  the  train¬ 
ing  set  has  the  same  size  as  the  original  data,  but  some 
examples  do  not  appear  in  it,  while  others  may  appear 
more  them  once.  From  each  replication  of  the  training 
set  a  classifier  is  generated.  All  classifiers  are  used  to 
classify  each  example  in  the  test  set,  usually  using  a 
uniform  vote  scheme. 

The  boosting  algorithm  of  Freund  and  Schapire  [12] 
maintains  a  weight  for  each  example  in  the  training 
set  that  reflects  its  importance.  Adjusting  the  weights 
causes  the  learner  to  focus  on  different  examples  lead¬ 
ing  to  different  classifiers.  Boosting  is  an  iterative  al¬ 
gorithm.  At  each  iteration  the  weights  are  adjusted  in 
order  to  reflect  the  performance  of  the  corresponding 
classifier.  The  weight  of  the  misclassified  examples  is 
increased.  The  final  classifier  aggregates  the  learned 
classifiers  at  each  iteration  by  weighted  voting.  The 
weight  of  each  classifier  is  a  function  of  its  accuracy. 

Wolpert  [25]  proposed  Stacked  Generalization,  a  tech¬ 
nique  that  uses  learning  in  two  levels.  A  learning  algo¬ 
rithm  is  used  to  determine  how  the  outputs  of  the  base 
classifiers  should  be  combined.  The  original  data  set 
constitutes  the  level  zero  data.  All  the  base  classifiers 
run  at  this  level.  The  level  one  data  are  the  outputs  of 
the  base  classifiers.  Another  learning  process  occurs 
using  as  input  the  level  one  data  and  as  output  the 
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final  classification.  This  is  a  more  sophisticated  tech¬ 
nique  of  cross  validation  that  could  reduce  the  error 
due  to  the  bias. 

Brodley  [4]  presents  MCS,  a  hybrid  algorithm  that 
combines,  in  a  single  tree,  nodes  that  are  univariate 
tests,  multivariate  tests  generated  by  linear  machines 
and  instance  based  learners.  At  each  node  MCS  uses 
a  set  of  If-  Then  rules  to  perform  a  hill-climbing  search 
for  the  best  hypothesis  space  and  search  bias  for  the 
given  partition  of  the  dataset.  The  set  of  rules  incor¬ 
porates  knowledge  of  experts.  MCS  uses  a  dynamic 
search  control  strategy  to  perform  an  automatic  model 
selection.  MCS  builds  trees,  which  could  apply  a  dif¬ 
ferent  model  in  different  regions  of  the  instance  space. 

Chan  and  Stolfo  [7]  presents  two  schemes  for  classi¬ 
fier  combination:  arbiter  and  combiner.  Both  schemes 
are  based  on  meta  learning,  where  a  meta-classifier  is 
generated  from  a  meta  data,  built  based  on  the  pre¬ 
dictions  of  the  base  classifiers.  An  arbiter  is  also  a 
classifier  and  is  used  to  arbitrate  among  predictions 
generated  by  different  base  classifiers.  The  training 
set  for  the  arbiter  is  selected  from  all  the  available 
data,  using  a  selection  rule.  An  example  of  a  selec¬ 
tion  rule  is  "Select  the  examples  whose  classification 
the  base  classifiers  cannot  predict  consistently”.  This 
arbiter,  together  with  an  arbitration  rule,  decides  a 
final  classification  based  on  the  base  predictions.  An 
example  of  an  arbitration  rule  is  “Use  the  prediction 
of  the  arbiter  when  the  base  classifiers  cannot  obtain 
a  majority”.  Later  [8],  they  have  extended  this  frame¬ 
work  using  arbiters /combiners  in  an  hierarchical  fash¬ 
ion  generating  arbiter / combiner  binary  trees. 

4.3  Discussion 

Earlier  results  of  boosting  or  bagging  are  quite  impres¬ 
sive.  Using  10  iterations  (i.e.  generating  10  classifiers) 
Quinlan  [22]  reports  reductions  of  the  error  rate  be¬ 
tween  10%  and  19%.  Quinlan  argues  that  these  tech¬ 
niques  are  mainly  applicable  for  unstable  classifiers. 
Both  techniques  require  that  the  learning  system  is 
not  stable,  to  obtain  different  classifiers  when  there 
are  small  changes  in  the  training  set.  Under  an  anal¬ 
ysis  of  bias- variance  decomposition  of  the  error  [16], 
the  reduction  of  the  error  observed  with  boosting  or 
bagging  is  mainly  due  to  the  reduction  in  the  variance. 
As  mentioned  in  Ali  et  al.  [1]  "the  number  of  training 
examples  needed  by  Boosting  increases  as  a  function  of 
the  accuracy  of  the  learned  model.  Boosting  could  not 
be  used  to  learn  many  models  on  the  modest  training 
set  sizes  used  in  this  paper.  ”. 


Wolpert  [25]  says  that  successful  implementations  of 
Stacked  Generalization  is  a  “black  art”,  for  classifi¬ 
cation  tasks  and  the  conditions  under  which  stacking 
works  are  still  unknown.  Recently,  Ting  and  Witten 
[23]  have  shown  that  successful  stacked  generalization 
requires  the  use  of  output  class  distributions  rather 
than  class  predictions.  In  their  experiments,  only  the 
MLR,  algorithm  (a  linear  discriminant)  was  suitable  for 
level-1  generalizer.  Cascade  Generalization  belongs  to 
the  family  of  stacking  algorithms.  In  the  experiments 
described  in  [14]  we  have  used  the  Bias  Variance  anal¬ 
ysis  as  a  criterion  to  select  algorithms.  The  experi¬ 
ments  suggest  that  at  the  top  level  an  algorithm  with 
low  bias,  like  a  decision  tree,  should  be  used. 

The  main  achievement  of  our  proposed  method  is  its 
ability  to  merge  different  models.  As  such,  we  get  a 
single  model  whose  components  are  terms  of  the  base 
model  language.  The  bias  restriction  imposed  by  us¬ 
ing  single  model  is  relaxed.  Cascade  gives  a  single  and 
structured  model  for  the  data,  and  this  is  a  strong  ad¬ 
vantage  over  the  methods  that  combine  classifiers  by 
voting.  Another  advantage  of  Cascade  Generalization 
is  related  to  the  use  of  probability  class  distributions. 
Usual  learning  algorithms  produced  by  the  Machine 
Learning  community  use  categories  when  classifying 
examples.  Combining  classifiers  by  means  of  categor¬ 
ical  classes  looses  the  strength  of  the  classifier  in  its 
prediction.  The  use  of  probability  class  distributions 
allows  us  to  explore  that  information. 

5  Empirical  Evaluation 

5.1  The  Algorithms 

Ali  and  Pazzani  [1]  and  Turner  and  Gosh  [24]  present 
empirical  and  analytical  results  that  show  that  "the 
combined  error  rate  depends  on  the  error  rate  of  in¬ 
dividual  classifiers  and  the  correlation  among  them”. 
They  suggest  the  use  of  “radically  different  types  of 
classifiers”  to  reduce  the  correlation  errors.  This  was 
our  criterion  when  selecting  the  algorithms  for  the  ex¬ 
perimental  work.  We  use  three  classifiers  that  have 
different  behaviors  under  a  bias-variance  analysis:  a 
naive  Bayes,  a  Linear  Discriminant,  and  a  Decision 
Tree. 

5.1.1  Naive  Bayes 

Bayes  theorem  allows  to  optimally  predict  the  class 
of  an  unseen  example,  given  a  training  set.  The 
chosen  class  is  the  one  that  maximizes:  p{Ci\E)  = 
p{Ci)p{E\Ci)lp{E).  If  the  attributes  are  indepen- 


Local  Cascade  Generalization 


211 


dent,  p{E\Ci)  can  be  decomposed  into  the  product 
pivilCi)  *  ...  *  p{vk\Ci).  Domingos  and  Pazzani  [9] 
show  that  this  procedure  has  a  surprisingly  good  per¬ 
formance  in  a  wide  variety  of  domains,  including  many 
where  there  are  clear  dependencies  between  attributes. 
In  our  reimplementation  of  this  algorithm,  the  required 
probabilities  are  estimated  from  the  training  set.  In 
the  case  of  nominal  attributes  we  use  counts.  Continu¬ 
ous  attributes  were  discretized.  This  has  been  found  to 
produce  better  results  than  assuming  a  Gaussian  dis¬ 
tribution  [10,  9].  The  number  of  bins  used  is  a  function 
of  the  number  of  different  values  observed  on  the  train¬ 
ing  set:  k  =  maa:(l;2  *  log{nr.  different  values)). 
This  heuristic  was  used  in  [10]  and  elsewhere  with  good 
overall  results.  Missing  values  were  treated  as  another 
possible  value  for  the  attribute.  In  order  to  classify 
a  query  point,  a  naive  Bayes  uses  all  of  the  available 
attributes.  Langley  [17]  refers  that  naive  Bayes  re¬ 
lies  on  an  important  assumption  that  the  variability 
of  the  dataset  can  be  summarized  by  a  single  prob¬ 
abilistic  description,  and  that  these  are  sufficient  to 
distinguish  between  classes.  Prom  an  analysis  of  Bias- 
Variance,  this  implies  that  naive  Bayes  uses  a  reduced 
set  of  models  to  fit  to  the  data.  The  result  is  low  vari¬ 
ance,  but  if  the  data  cannot  be  adequately  represented 
by  the  set  of  models,  we  obtain  large  bias. 

5.1.2  Linear  Discriminant 

A  linear  discriminant  function  is  a  linear  composition 
of  the  attributes  where  the  sum  of  squared  differences 
between  class  means  is  maximal  relative  to  the  internal 
class  variance.  It  is  assumed  that  the  attribute  vectors 
for  the  examples  of  class  Ci  are  independent  and  follow 
a  certain  probability  distribution  with  probability  den¬ 
sity  function  /j.  A  new  point  with  attribute  vector  x 
is  then  assigned  to  that  class  for  which  the  probability 
density  function  fi{x)  is  maximal.  This  means  that 
the  points  for  each  class  are  distributed  in  a  cluster 
centered  at  /Xj.  The  boundary  separating  two  classes 
is  a  hyper-plane  and  it  passes  through  the  midpoint  of 
the  two  centers.  If  there  are  only  two  classes,  a  unique 
hyper-plane  is  needed  to  separate  the  classes.  In  the 
general  case  of  q  classes,  q—1  hyper-planes  are  needed 
to  separate  them.  By  applying  the  linear  discriminant 
procedure  described  below,  we  get  qnode  —  1  hyper¬ 
planes.  The  equation  of  each  hyper-plane  is  given  by: 

Hi  =ai  +  Y,j  l^ij  *  ^3  where 

ai  =  and  /!»  =  5“ Vi 

We  use  a  Singular  Value  Decomposition  (SVD)  to  com¬ 
pute  S~^.  SVD  is  numerically  stable  and  is  a  tool  for 


detecting  sources  of  collinearity.  This  last  aspect  is 
used  as  a  method  for  reducing  the  features  of  each 
linear  combination.  A  linear  discriminant  uses  all,  or 
almost  all,  of  the  available  attributes  when  classifying 
a  query  point.  Breiman[2]  refers  that  from  an  anal¬ 
ysis  of  Bias- Variance,  Linear  Discriminant  is  a  stable 
classifier  although  it  can  fit  a  small  number  of  models. 
It  achieves  stability  by  having  a  limited  set  of  models 
to  fit  the  data.  The  result  is  low  variance,  but  if  the 
data  cannot  be  adequately  represented  by  the  set  of 
models,  then  we  obtain  large  bias. 

5.1.3  Decision  Tree 

Dtree  is  our  version  of  a  decision  tree.  It  uses  the 
standard  algorithm  to  build  a  decision  tree.  The  split¬ 
ting  criterion  is  the  gain  ratio.  The  stopping  criterion 
is  similar  to  C4.5.  The  pruning  mechanism  is  simi¬ 
lar  to  the  pessimistic  error  of  C4.5.  Dtree  uses  a  kind 
of  smoothing  process  that  usually  improves  the  perfor¬ 
mance  of  tree  based  classifiers.  When  classifying  a  new 
example,  the  example  traverses  the  tree  from  the  root 
to  a  leaf.  In  Dtree,  the  example  is  classified  taking 
into  account  not  only  the  class  distribution  at  the  leaf, 
but  also  all  class  distributions  of  the  nodes  in  the  path. 
That  is,  all  nodes  in  the  path  contribute  to  the  final 
classification.  Instead  of  computing  class  distribution 
for  all  paths  in  the  tree  at  classification  time,  as  it  is 
done,  for  instance,  in  Buntine  [6],  Dtree  computes  a 
class  distribution  for  all  nodes  when  growing  the  tree. 
This  is  done  recursively,  taking  into  account  class  dis¬ 
tributions  at  the  current  node  and  at  the  predecessor 
of  the  current  node,  using  the  formula: 

P{Ci\e„,e)  =  P{Ci\en)^^^ 

where  F(e|e„)  is  the  probability  that  one  example  that 
falls  at  Noden  goes  to  Nodcn+i,  and  F(e|e„,  Ci)  is  the 
probability  that  one  example  from  class  Ci  goes  from 
Noden  to  Noden+i  [21],  This  recursive  formulation, 
allows  Dtree  to  compute  efficiently  the  required  class 
distributions  on  the  fly.  The  smoothed  class  distribu¬ 
tions  have  influence  on  the  pruning  mechanism  and  on 
the  treatment  of  missing  values.  It  is  the  most  relevant 
difference  from  C4.5. 

A  decision  tree  uses  a  subset  of  the  available  attributes 
to  classify  a  query  point.  Kohavi  and  Wolpert  [16], 
Breiman  [2,  3]  among  other  researchers,  note  that  de¬ 
cision  trees  are  unstable  classifiers.  Small  variations 
on  the  training  set  can  cause  large  changes  in  the  re¬ 
sulting  predictors.  They  have  high  variance  but  they 
can  fit  any  kind  of  data:  the  bias  of  a  decision  tree  is 
low. 
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5.1.4  Local  Cascade  Generalization 
Algorithms 

All  the  implemented  Local  Cascade  Generalization  al¬ 
gorithms  are  based  on  Dtree.  That  is  they  use  exactly 
the  same  splitting  criteria,  stopping  criteria,  pruning 
mechanism,  etc.  Moreover  they  share  many  minor 
heuristics  that  individually  are  too  small  to  mention, 
but  collectively  can  make  difference. 

At  each  decision  node,  CGLtree  applies  the  Linear 
discriminant  describe  above,  while  CGBtree  applies 
the  naive  Bayes  algorithm.  CGBLtree  applies  the 
Linear  discriminant  to  the  ordered  attributes  and  the 
naive  Bayes  to  the  categorical  attributes.  In  order  to 
prevent  overfitting  the  construction  of  new  attributes 
is  constrained  to  a  depth  of  5.  In  addition,  the  level  of 
pruning  is  greater  than  the  level  of  pruning  in  Dtree. 

5.2  The  Datasets 

We  have  chosen  17  data  sets  from  the  UCI  repository. 
All  of  them  were  previously  used  in  other  comparative 
studies.  Evaluation  was  done  using  a  10  fold  stratified 
Cross  Validation  (CV).  Datasets  were  permuted  once 
before  the  CV  procedure.  All  algorithms  where  used 
with  the  default  settings.  At  each  iteration  of  CV, 
all  algorithms  were  trained  on  the  same  training  par¬ 
tition  of  the  data.  Classifiers  were  also  evaluated  on 
the  same  test  partition  of  the  data.  Comparisons  be¬ 
tween  algorithms  were  performed  using  t-paired  tests 
with  significance  level  set  at  95%. 

Table  1  presents  the  data  sets  characteristics,  the  er¬ 
ror  rate,  and  standard  deviation  of  each  base  classifier. 
Relative  to  each  algorithm,  a  -t-(-)  sign  on  the  first 
column  means  that  the  error  rate  of  this  algorithm,  is 
significantly  better  (worse)  than  Dtree.  The  error  rate 
of  C5.0  is  presented  for  reference.  These  results  pro¬ 
vide  an  evidence,  once  more,  that  no  single  algorithm 
is  better  overall. 

5.3  Local  Cascade  Generalization 

Table  2a  presents  the  results  of  local  Cascade  Gen¬ 
eralization.  Each  column  corresponds  to  a  Cascade 
Generalization  algorithm.  Each  algorithm  is  com¬ 
pared  against  its  components  using  t-paired  tests.  For 
example,  CGLtree  is  compared  against  Dtree  and 
Discrim.  A  -l-(-)  sign  means  that  the  error  rate  of 
the  composite  model  is,  with  statistical  significance, 
higher  (lower)  than  the  respective  component  model. 
The  trend  on  these  results  shows  a  clear  improvement 
over  the  base  classifiers.  We  never  observe  degradation 


on  the  error  rate  of  a  composite  model  in  relation  to 
all  the  components.  In  same  cases  there  is  a  significant 
increase  of  performance  comparing  to  all  the  compo¬ 
nents.  For  example  CGBLtree  improves  in  2  datasets 
over  the  3  components,  and  in  5  datasets  over  2  com¬ 
ponents. 

Table  2b  presents  the  results  of  C5.0  boosting  with 
the  default  parameter  of  10,  that  is  aggregating  over 
10  trees,  and  Stacked  Generalization  as  it  is  defined 
in  [23].  That  is,  the  levelo  classifiers  are  C4.5  and 
Bayes,  and  the  leveh  classifier  is  Discrim.  The  at¬ 
tributes  for  the  leveli  data  are  the  probability  class 
distributions,  obtained  from  the  levelo  classifiers  us¬ 
ing  a  5  stratified  cross  validation.  Both  Boosting 
and  Stacked  are  compared  against  CGBLtree,  us¬ 
ing  t-paired  tests  with  the  significance  level  set  to 
95%.  A  -l-(-)  sign  means  that  Boosting  or  Stacked 
performs  significantly  better  (worst)  than  CGBLtree. 
In  this  study,  CGBLtree  performs  significantly  bet¬ 
ter  than  Stacked,  in  5  datasets  and  never  performs 
worse.  Comparing  with  Cb.OBoosting,  CGBLtree 
significantly  improves  in  4  datasets  and  loses  in  3 
datasets.  The  improvement  observed  with  Boosting 
is  mainly  due  to  the  reduction  of  the  variance  com¬ 
ponent  of  the  error  rate  while,  in  Cascade  algorithms, 
the  improvement  is  mainly  due  to  the  reduction  on  the 
bias.  We  intend,  in  a  near  future,  to  boost  CGBLtree. 

Another  dimension  for  comparisons  involves  measur¬ 
ing  the  number  of  leaves.  This  corresponds  to  the 
number  of  different  regions  into  which  the  instance 
space  is  partitioned  by  the  algorithm.  In  almost  all 
datasets^,  any  Cascade  tree  splits  the  instance  space 
into  half  of  the  regions  needed  by  Dtree  or  C5. 0.  This 
is  a  clear  indication  that  Cascade  models  capture  bet¬ 
ter  the  underlying  structure  of  the  data. 

6  Conclusions 

This  paper  presents  a  new  methodology  for  classifier 
combination.  The  basic  idea  of  Cascade  Generaliza¬ 
tion  consists  of  a  reformulation  of  the  input  space  by 
means  of  insertion  of  new  attributes.  A  base  classi¬ 
fier  computes  the  new  attributes.  Each  new  attribute 
is  the  instantiation  of  P{Ci\x)  given  by  the  predictor 
function  generated  by  the  base  classifier  on  this  ex¬ 
ample.  In  this  sense,  the  new  attributes  are  terms, 
or  functions,  in  the  representational  language  of  the 
base  classifier.  This  constructive  step  acts  as  a  way 
of  extending  the  description  language  of  the  high  level 

^Except  on  Monks-2  dataset,  where  both  Dtree  and 
C5.0  produce  a  tree  with  only  one  leaf. 
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Dataset 

Class  Nr.  Ex. 

Types 

Dtree 

C5.0 

Bayes 

Discrim 

Australian 

2 

690 

8  Ord,6  Cont 

14.37 

±6.18 

13.63  ±4.36 

15.07  ±3.76 

14.05  ±5.23 

Balance 

3 

625 

4  Cont 

21.91 

±4.63 

21.92  ±4.93 

30.08  ±7.01 

+ 

13.14  ±2.46 

Breast  (W) 

2 

699 

9  Ord 

5.84 

±4.64 

5.42  ±4.08 

+ 

2.43  ±2.52 

+ 

4.27  ±4.58 

Diabetes 

2 

768 

8  Cont 

25.14 

±5.78 

23.69  ±6.48 

24.62  ±4.58 

22.92  ±4.97 

German 

2 

1000  17  Ord,7  Cont 

28.70 

±4.30 

29.10  ±2.81 

27.60  ±5.15 

+ 

23.50  ±5.54 

Glass 

6 

213 

9  Cont 

31.85 

±7.61 

32.30  ±10.19 

46.35  ±10.93 

- 

36.78  ±8.07 

Heart 

2 

270 

6  Ord, 7  Cont 

25.16 

±9.84 

22.96  ±8.69 

+ 

15.93  ±8.56 

± 

15.93  ±4.29 

Ionosphere 

2 

351 

33  Cont 

8.54 

±5.80 

9.66  ±3.47 

11.07  ±7.76 

- 

14.26  ±4.68 

Iris 

3 

150 

4  Cont 

4.67 

±5.48 

4.67  ±4.50 

4.00  ±4.66 

2.00  ±3.22 

Monks-1 

2 

432 

6  Ord 

6.33 

±7.45 

+  0.00  ±0.00 

25.07  ±5.96 

- 

33.39  ±10.06 

Monks-2 

2 

432 

6  Ord 

32.90 

±0.63 

32.86  ±0.65 

49.32  ±8.50 

33.32  ±1.60 

Monks-3 

2 

432 

6  Ord 

0.00 

±0.00 

0.00  ±0.00 

2.79  ±2.42 

- 

22.89  ±8.96 

Satimage 

6 

6435 

36  Cont 

13.35 

±1.51 

13.53  ±1.57 

19.55  ±1.48 

- 

15.91  ±1.49 

Segment 

7 

2310 

18  Cont 

3.64 

±1.13 

3.38  ±1.34 

10.22  ±0.74 

- 

8.18  ±0.83 

Vehicle 

4 

846 

18  Cont 

28.11 

±4.87 

27.27  ±5.48 

37.70  ±2.18 

± 

22.34  ±2.87 

Waveform 

3 

2581 

21  Cont 

23.38 

±3.40 

-  24.88  ±2.94 

+ 

18.52  ±2.24 

± 

15.15  ±1.86 

Wine 

3 

178 

13  Cont 

6.66 

±6.32 

7.19  ±7.44 

2.22  ±3.88 

+ 

0.56  ±1.76 

Mean  of  error  rate 

16.50 

16.02 

20.15 

n.56 

Mean  nr.  Leaves 

45.6 

51.3 

Table  1:  Data  Characteristics  and  Results  of  Base  Classifiers 


CSBoost 

Stacked 

13.337  ±3.33 

-  20.184  ±4.17 

3.135  ±3.20 
24.728  ±5.46 
23.200  ±2.35 
25.020  ±10.09 
19.630  ±9.25 
+  5.947  ±3.06 

-  5.333  ±4.22 

0.000  ±0.00 

-  36.353  ±5.87 

0.000  ±0.00 
+  9.062  ±1.07 

+  1.905  ±1.05 

-  24.922  ±3.71 
17.980  ±1.86 

2.222  ±2.87 

13.766  ±4.47 

-  12.309  ±3.63 

2.427  ±2.52 
22.657  ±5.42 
24.800  ±4.24 
35.753  ±6.20 
16.667  ±8.24 
10.758  ±7.33 

4.667  ±3.22 

0.682  ±2.16 

-  32.865  ±0.65 

-  2.072  ±2.01 

-  13.303  ±1.63 

3.420  ±1.35 

-  27.731  ±5.06 
16.429  ±1.50 

2.778  ±3.93 

13.70 

14.30 

Dataset 

CGLtree 

CGBtree 

CGBLtree 

Australian 

14.354 

±4.77 

14.499 

±3.76 

14.058 

±4.80 

Balance 

+ 

+ 

7.016 

±2.68 

+ 

+ 

6.704 

±3.64 

+ 

+ 

+ 

7.016 

±2.68 

Breast  (W) 

+ 

3.280 

±2.59 

+ 

2.712 

±2.27 

3.280 

±2.68 

Diabetes 

23.565 

±3.12 

26.693 

±5.87 

23.565 

±3.12 

German 

24.700 

±4.19 

27.100 

±5.48 

- 

25.300 

±5.25 

Glass 

33.866 

±9.26 

4- 

27.004 

±7.51 

+ 

33.866 

±9.26 

Heart 

4- 

17.037 

±5.58 

+ 

16.667 

±6.11 

+ 

17.037 

±5.58 

Ionosphere 

11.363 

±4.32 

9.369 

±5.12 

11.363 

±4.32 

Iris 

2.667 

±3.44 

4.000 

±4.66 

2.667 

±3.44 

Monks-l 

2.976 

±4.43 

+ 

14.372 

±8.69 

± 

+ 

2.565 

±3.77 

Monks-2 

33.335 

±5.81 

+ 

+ 

13.874 

±6.97 

+ 

+ 

+ 

11.120 

±5.36 

Monks-3 

+ 

0.698 

±1.12 

+ 

0.465 

±0.47 

+ 

+ 

0.465 

±1.47 

Satimage 

+ 

12.385 

±1.44 

+ 

+ 

11.673 

±1.25 

+ 

+ 

12.385 

±1.44 

Segment 

+ 

3.853 

±1.22 

+ 

4.416 

±1.47 

+ 

+ 

3.853 

±1.21 

Vehicle 

± 

21.025 

±3.08 

+ 

28.844 

±3.88 

-f 

+ 

21.025 

±3.08 

Waveform 

+ 

16.351 

±1.68 

+ 

16.004 

±2.78 

+ 

16.351 

±1.68 

Wine 

+ 

0.556 

±1.76 

3.403 

±3.94 

+ 

0.556 

±1.76 

Mean  error  rate 

13.47 

13.40 

12.15 

Mean  nr.leaves 

23.9 

23.7 

22.9 

Table  2:  Results  of  (a)  Local  Cascade  Generalization  (b)  Boosting  and  Stacked 


classifiers.  The  number  of  new  attributes  is  equal  to 
the  number  of  classes,  and  for  each  example,  they  are 
computed  as  the  conditional  probability  of  the  exam¬ 
ple  belonging  to  classi  given  by  the  base  classifier. 

Cascade  Generalization  can  be  applied  locally  by  any 
learning  algorithm  that  uses  a  divide-conquer  strategy. 
As  pointed  by  several  researchers,  successful  combina¬ 
tion  of  classifiers  requires  different  syntactic  models. 
We  have  chosen,  for  the  implementation  of  Local  Cas¬ 
cade  Generalization  algorithms,  three  algorithms  that 
have  very  different  behavior  from  a  bias-variance  anal¬ 
ysis;  as  high  level  classifier  we  use  a  decision  tree  and 
as  low  level  classifier  we  use  a  naive  Bayes,  giving  CG- 
Btree  and  a  Linear  Discriminant,  giving  CGLtree.  At 
each  decision  node  a  constructive  step  is  performed  by 
applying  the  base  classifier.  The  new  axis  incorporates 
new  knowledge  provided  by  the  base  classifiers.  The 


bias  restriction  imposed  by  using  single  model  classes 
is  relaxed  in  the  directions  given  by  the  base  classi¬ 
fiers.  It  is  this  kind  of  synergy  among  classifiers  that 
Cascade  explores. 

There  are  two  main  issues  that  differentiate  Cascade 
from  other  previous  methods  on  multiple  models.  The 
first  one  is  related  to  its  ability  to  be  applied  locally 
merging  different  models.  We  get  a  single  model  whose 
components  are  terms  of  the  base  model  language,  ex¬ 
tending  the  high  level  model  language.  Cascade  gives 
a  single  structured  model  for  the  data,  and  in  this 
way  is  more  adapted  to  capture  insights  about  prob¬ 
lem  structure.  The  second  point  is  related  to  the  use 
of  probability  class  distributions.  Using  these  prob¬ 
abilities  allows  the  system  to  use  information  about 
the  strength  of  the  classifier.  This  is  very  useful  in¬ 
formation,  particularly  when  combining  predictions  of 
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classifiers.  We  have  shown  that  this  methodology  can 
improve  the  accuracy  of  the  base  classifiers,  competing 
well  with  other  methods  for  combining  classifiers,  pre¬ 
serving  the  ability  to  provide  a  single  albeit  structured 
model  for  the  data. 
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Abstract 

Many  reinforcement  learning  algorithms,  like 
Q-Learning  or  R-Learning,  correspond  to 
adaptative  methods  for  solving  Markovian 
decision  problems  in  infinite-horizon  when 
no  model  is  available.  In  this  article  we 
consider  the  particular  framework  of  non¬ 
stationary  finite-horizon  Markov  Decision 
Processes.  After  establishing  a  relationship 
between  the  finite-horizon  total  reward  cri¬ 
terion  and  the  average-reward  criterion  in 
finite-horizon,  we  define  QT^-Learning  and 
R-H-Learning  for  finite-horizon  MDPs.  Then 
we  introduce  the  Ordinary  Differential  Equa¬ 
tion  (ODE)  method  to  conduct  a  learn¬ 
ing  rate  analysis  of  Q?f-Learning  and  R-^- 
Learning.  R'H-Learning  appears  to  be  a  ver¬ 
sion  of  Q-^-Learning  with  matrix- valued  step- 
sizes,  the  corresponding  gain  matrix  being 
very  close  to  the  optimal  matrix  which  re¬ 
sults  from  the  ODE  analysis.  Experimental 
results  confirm  that  performance  hierarchy. 

1  Introduction 

The  search  for  optimal  policies  in  Markov  Decision 
Processes  has  been  deeply  studied  according  to  dif¬ 
ferent  optimality  criteria  and  has  led  to  the  definition 
of  the  well  known  Bellman  optimality  equations,  and 
dynamic  programming  algorithms  [Puterman,  1994]. 
Most  Reinforcement  Learning  (RL)  algorithms  that 
have  been  recently  developed  [Kaelbling  et  ah,  1996, 
Bertsekas  and  Tsitsiklis,  1996]  take  a  stochastic  opti¬ 
mization  approach  to  solve  these  optimality  equations, 
by  directly  learning  the  optimal  policies  from  iterated 
observations  of  rewards  and  state  transitions,  without 
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a  priori  knowledge  about  the  system. 

In  this  paper  we  consider  the  case  of  non  stationary 
Markov  decision  problems  in  a  finite  horizon.  Despite 
being  an  accurate  modelling  of  many  applications  con¬ 
cerning  the  management  of  industrial  production  sys¬ 
tems,  finite-horizon  MDPs  have  not  been  yet  specifi¬ 
cally  considered  in  reinforcement  learning.  This  article 
is  a  first  attempt  to  fill  this  gap. 

Our  work  relies  on  two  parts.  First  we  propose  a  refor¬ 
mulation  of  the  two  main  classical  optimality  criteria, 
expected  total  reward  criterion  and  average  expected 
reward  criterion,  given  the  finite-horizon  assumption. 
After  establishing  an  equivalence  between  them,  we 
conclude  that  it  is  possible  to  use  the  two  adapted 
reinforcement  learning  algorithms,  Q’H-Learning  and 
R?{-Learning,  to  learn  optimal  policies  for  non  station¬ 
ary  finite-horizon  MDPs. 

Secondly,  we  conduct  an  analysis  of  the  respec¬ 
tive  rates  of  convergence  of  Q'H-Learning  and  R^^- 
Learning.  Surprisingly,  R-H-Learning  appears  to  be  a 
version  of  Q-H-Learning  with  matrix-valued  stepsizes. 
Furthermore,  the  ordinary  differential  equation  (ODE) 
method  enables  to  determine  a  theoretical  optimal 
matrix- valued  gain,  and  it  appears  that  the  gain  cor¬ 
responding  to  R?i-Learning  is  numerically  and  struc¬ 
turally  very  close  to  that  optimal  gain.  The  experi¬ 
mental  study  we  conducted  confirms  these  results:  in 
most  situations  we  tested,  R^^-Learning  performs  bet¬ 
ter  than  QT^-Learning,  and  the  implementation  of  the 
optimal  matrix- valued  gain  defines  a  reinforcement  al¬ 
gorithm  that  surpasses  R-H-Learning. 
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2  Reinforcement  Learning  in 
Finite-Horizon 

2.1  Non-Stationary  MDP  in  Finite-Horizon 

The  majority  of  reinforcement  learning  algorithms 
solve  stationary  infinite-horizon  Markov  Decision 
Problems.  Given  a  state-space  S  and  an  action-space 
A,  the  dynamic  of  a  Markov  decision  process  is  char¬ 
acterized  as  follows:  at  each  time  step  t  £  T,  the  ex¬ 
ecution  of  action  Of  €  ^4  in  state  Xt  £  S  leads  to  the 
new  state  Xt+i  G  S  with  a  probability  pixi+i  \  xt,at), 
and  to  the  instantaneous  reward  r{xt,at). 

A  Markov  Decision  Problem  is  defined  by  adding  to 
that  process  a  performance  criterion  to  maximize  over 
a  set  of  decisional  policies.  This  criterion  is  a  measure 
of  the  expected  sum  of  the  rewards  along  a  trajectory, 
and  policies  are  functions  that  indicate  the  action  at  to 
execute  given  informations  about  the  past  trajectory 
at  time  t.  For  stationary  infinite-horizon  Markov  deci¬ 
sion  problems,  most  of  the  perfomance  criteria  lead  to 
the  existence  of  stationary  optimal  policies,  i.e.  func¬ 
tions  TT  that  map  states  in  S  to  actions  in  A. 

In  finite-horizon  problems,  trajectories  are  sequences 
of  exactly  N  transitions,  with  T  =  {!,..., A’}. 
The  performance  criterion  considered  in  that  case  is 
the  finite  total  expected  reward  criterion  V^ix)  = 
En[r{xi,ai)  +  r{x2,a2)  +  ...  +  r{xi^,aiy;)  |  zi  =  a;] 
where  x  £  S  and  Ejr  is  the  expected  value  given  the 
policy  TT. 

When  dealing  with  finite-horizon  MDPs,  the  station¬ 
ary  assumption  cannot  be  considered  anymore.  A  first 
reason  is  that  even  for  stationary  finite-horizon  MDP 
models  (time-independent  spaces  S  and  A,  transition 
probabilities  p{)  and  rewards  rO),  optimal  policies  are 
no  longer  stationary,  and  are  functions  of  T  x  S  into 
A:  t,  X  ■K{t,x)  [Puterman,  1994]. 

More  practically,  in  most  of  the  problems  of  indus¬ 
trial  production  process  control  that  lead  to  finite- 
horizon  MDPs,  the  main  cause  of  non-stationarity  of 
optimal  policies  is  the  non-stationarity  of  the  MDP 
model  itself:  is  is  very  common  to  have  different  state- 
spaces,  decision-spaces,  transition  probabilities  and  re¬ 
ward  values  at  each  decision  step.  In  order  to  take  into 
account  this  characteristic,  we  consider  the  following 
formal  model  of  finite-horizon  MDP:  1)  to  each  time 
step  iGr  =  {l,2,...,A^}is  associated  a  finite  state 
space  Si  and  a  finite  decision  space  Ai\  2)  for  each 
step  i  £  {1,  2,  . . . ,  iV  -  1},  the  execution  of  action 


Oi  £  Ai  from  the  state  Xj  £  Si  leads  to  the  new  state 
Xi+i  £  Si+i  with  a  probability  Pi{xi+i  \  Xi,ai),  and 
with  the  instantaneous  reward  ri(xi,ai);  3)  at  the  last 
decision  step,  the  system  receives  a  reward  rw(a:yv,  oyv) 
after  the  execution  of  a/v  in  xjv,  and  stops. 

For  this  particular  kind  of  MDP  a  policy  tt  can 
be  decomposed  into  a  set  {7ri,7r2,  ...jTfjy)  of  poli¬ 
cies  TTi  :  Si  ->■  Ai.  For  each  decision  step,  a 
value  function  associated  to  tt  is  defined  as  Vj^(x)  = 
rt{Xt,Mxt))  I  Xi  X]. 

We  say  a  policy  tt  is  optimal  if  it  maximizes  the  value 
function  on  S\.  For  this  criterion,  the  classical 
Bellman  optimality  equations  that  characterize  opti¬ 
mal  policies  are 

V'j*(x)  =  max  <  ri(x,o)  -f  E  Pi{y  I  x,a)Vi\t{y)  > 

(  y6S,  +  i  J 

(1) 

for  all  X  £  Si,  i  £  {1,..,A^}  and  =  0 

[Puterman,  1994].  Then  7r*(x)  =  argmaXo{ri(x, a)-l- 
Y^y€Si+iPi(y  I  2;,a)V;!;i(2/)}.  This  optimality  equation 
has  a  single  solution  V*  =  Vjvli  that  can  be 

easily  obtained  by  a  dynamic  programming  algorithm 
in  0{N.nA-ng)  complexity  (for  state  spaces  Si  and 
decision  spaces  Ai  of  constant  size  ns  and  ua)  when 
transition  probabilities  and  reward  function  are  known 
[Puterman,  1994].  The  associated  learning  problem  is 
to  adaptatively  estimate  the  optimal  value  functions 
V*  and  the  corresponding  policies  tt*,  from  observed 
transitions  and  rewards  when  the  Markov  decision  pro¬ 
cess  is  not  known. 

2.2  QT^-Learning  in  Finite-Horizon 

Q-Learning  [Watkins,  1989]  is  based  on  the  rewrit¬ 
ing  of  the  Bellman  optimality  equation,  replacing  the 
V’^{x)  value  function  of  a  policy  by  a  new  function 
Q'^{x,a):  for  all  x  £  Si,  a  £  Ai  QJ{x,a)  =  ri{x,a)  -f 

T,y^s.+^Piiy  I  x,a)Viltiy)- 

We  have  V^{x)  =  Q^{x,tt{x)),  and  the  optimality 
equation  becomes 

Q*{x,a)  =  ri{x,a)  +  Y]  pi(i/ j  x, o)  max (y, fe), 

0 

for  all  X  £  Si  and  a  £  Ai.  Then  V*{x)  = 
maxaQ*(x,a)  and  7r*(x)  =  argmax^  (?*(x,  a). 

Q-Learning  is  a  reinforcement  learning  algorithm  al¬ 
lowing  the  iterative  generation  of  the  solution  Q*  and 
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the  optimal  policies  tt*  .  The  algorithm  consists  in  up¬ 
dating  at  each  iteration  n  the  estimation  Qn  of  the 
value  function  Q*,  from  the  current  observed  transi¬ 
tion  and  reward  <  >.  Q-Learning  is  a 

natural  candidate  for  solving  finite-horizon  MDPs.  It 
is  indeed  easy  to  transform  an  N-step  non-stationary 
MDP  into  an  infinite-horizon  process,  by  adding  an 
artificial  final  absorbing  state  Xabs ;  that  is  reward-free 
and  such  that  all  actions  in  A^  lead  with  proba¬ 
bility  1  to  Xabs  (figure  1). 


SI  S2  SN 


Figure  1;  infinite  process  with  absorbing  state 

Hence  the  first  reinforcement  learning  algorithm  we 
propose  for  finite  horizon  MDPs  is: 

Finite-Horizon  Q?^-Learning _ 

Observe  <  > 

Update 

— ^71(2:,  n)  “I*  n).67i  (2) 

with  Cn  = 

rn+majib  Qniyn,t>)-Qn{x,a)  if  (x,a)=(x„,a„),i„eSi,i<iV 
rn-Qnix,a)  if  {x,a)  =  {x„,a„),x„eS[f 

0  otherwise 

If  Xfi  ^  Sn  set  =  2/71  ; 
otherwise  choose  randomly  x„+i  in  Si. 

If  Xn+i  G  Sj  select  a„+i  in  Aj 

In  this  algorithm  Qo(a;,a)  =  0  and  a„(x, a)  are  small 
learning  rates  decaying  over  time.  The  state  explo¬ 
ration  is  classically  determined  by  the  dynamic  of  the 
process  (that  is,  Xn+i  =  Un),  until  the  last  decision 
step  is  reached  and  we  restart  a  new  trajectory  by 
choosing  randomly  a  new  initial  state  in  Si .  The  spe¬ 
cific  learning  rule  for  5^  is  equivalent  to  directly  set¬ 
ting  Ujv+i(xo6s)  =  Q*N+i{xabs,aioop)  =  0.  The  action 
selection  is  as  usual  based  on  an  exploration  function. 

Let  us  assume  that  each  pair  (x,a)  in  5*  x  Ai 
is  visited  an  infinite  number  of  times,  and  that 
Yjn^n{x,a)  =  00  and  Y.n°^'n{x,a)  <  00.  The 
convergence  of  Q-Learning  [Watkins  and  Dayan,  1992, 


Jaakkola  et  ah,  1994,  Tsitsiklis,  1994]  in  case  of  no¬ 
discounting  (7=1)  and  with  the  presence  of  reward- 
free  absorbing  states  proves  that  this  finite-horizon 
Q-^-Learning  algorithm  will  converge  in  probability 
1  towards  the  optimal  value  function:  Vx  G  Sj,  a  G 
Ai,  lim  Qnix,a)  =  Q*(x,a)  a.s.  with  Vf(x)  = 

n—>^oo 

maxaeAi  Qi(x,a). 

2.3  RT^-Learning  and  the  average-reward 
criterion 

The  average  reward  criterion  was  introduced  in  Re¬ 
inforcement  Learning  by  Schwartz  through  the  R- 
Learning  algorithm  [Schwartz,  1993].  It  has  been 
studied  since  then  by  many  researchers  [Singh,  1994, 
Ok  and  Tadepalli,  1996,  Mahadevan,  1996b].  The 
goal  is  to  search  for  gain-optimal  policies  that  max¬ 
imize  the  expected  payoff  per  step,  which  is  a  very 
natural  measure  of  optimal  acting: 

1  " 

p^{x)  =  lim  E„[-'^rt  \  Xi  =  x]. 

n-+co  fi  ^ — • 

For  the  particular  case  of  unichain  MDPs  (that  is, 
for  all  policy  tt,  the  Markov  chain  {x„}„  contains  a 
single  recurrent  class  of  states,  and  a  possibly  empty 
set  of  transient  states),  the  average  reward  associated 
to  each  policy  is  independent  of  the  state  :  p^{x)  = 
P^{y)  =  P*'  For  simplicity  reasons,  most  of  the  results 
concerning  average  reward  criterion  in  Reinforcement 
Learning  have  been  established  with  this  unichain  as¬ 
sumption  [Mahadevan,  1996b]. 

A  more  selective  optimality  criterion  can  be  defined. 
It  is  based  on  a  new  value  function  of  a  policy  tt, 
called  bias  value  [Puterman,  1994].  For  all  state  x  G  5 
we  have 

n 

lE'ix)  =  lim  ET,(^{rt  -  p^)  \  xi  -  x], 
t=l 

A  policy  TT*  is  said  to  be  bias-optimal  (or  T-optimal  in 
[Schwartz,  1993])  if  it  is  gain-optimal,  and  if  (x)  > 
U^{x)  for  all  X  and  all  policy  tt. 

The  existence  of  optimal  stationary  policies  for  gain 
and  bias  optimality  has  been  shown  [Puterman,  1994]. 
For  all  unichain  MDPs,  there  exists  a  pair  {U*,p*) 
solution  of  the  Bellman  equation  for  the  average  crite¬ 
rion: 

U*{x)  +p*  =  max  (r{x,a)  '^piy  \  x,a)U*{y)^  , 

\  y 


(3) 
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for  all  X  £  S,  such  that  the  average  reward  of  the  policy 
77*  that  maximizes  the  right-hand  side  of  (3)  is  the  op¬ 
timal  average  reward  p* .  Furthermore,  if  {U*',p*')  is 
also  a  solution  of  (3),  then  p*  =  p*' ■  The  solutions  f/* 
of  (3)  are  not  unique,  since  for  each  solution  ({/*,p*), 
the  pair  ([/*  -f  k,p*)  is  also  a  solution  (one  can  show 
that  this  is  a  complete  characterization  of  the  set  of  so¬ 
lutions  for  unichain  MDP  models  [Puterman,  1994]). 

That  last  remark  shows  that  (3)  is  not  sufficient  to  pro¬ 
duce  bias  optimal  policies.  Another  optimality  equa¬ 
tion,  based  on  a  third  notion  of  value  function  called 
bias  offset,  is  generally  required  [Puterman,  1994, 
Mahadevan,  1996a]. 


second  one  establishes  an  equivalence  between  the 
bias- value  function,  the  average  reward  and  the  value 
function  in  finite-horizon,  and  the  last  one  com¬ 
pletely  determines  this  bias- value  function.  Prom 
that  properties  we  proved  the  following  theorem 
[Garcia  and  Ndiaye,  1998]; 

Theorem  1  If  {U*,p*)  is  a  solution  of  average- 
reward  Bellman  equation  (3)  for  Si  — t  Sn  0  Si  with 
the  constraint  —  0> 

sociated  gain-optimal  policy,  then  the  value  functions 
=  Uf{x)  -\-  {N  -i  l)p*  are  solutions  of  finite- 
horizon  Bellman  equation  (1),  and  77*  is  a  policy  that 
maximizes  Vf^ix)  for  x  E  Si,  i  =  1, . . .  ,N . 


In  order  to  adapt  the  average-reward  criterion  to  finite- 
horizon  MDPs,  we  first  transform  the  initial  process 
5i  ->  5a7  into  a  new  infinite  process  Si  Sn  O  Si. 
The  natural  solution  we  propose  is  to  close  artificially 
the  loop  between  S;v  and  Si  by  adding  a  uniform  tran¬ 
sition:  Vx  e  Sn,  Va  G  As,  \/y  G  Si,PN{y  I 
(figure  2). 


Figure  2:  infinite  process  with  looping  on  Si 


That  result  shows  that  there  is  an  equivalence  between 
the  finite-horizon  and  average-reward  criteria,  and  a 
solution  of  (3)  necessary  leads  to  a  solution  of  (1). 
The  following  corollary  characterizes  more  deeply  this 
equivalence  [Garcia  and  Ndiaye,  1998]. 

Corollary  1  If  {U* ,p*)  is  solution  of  (S)  for  Si  -t 
Ss  O  Si  with  Z)iieSi  ^‘{xi)  =  0,  then  {U*,p*)  also 
defines  a  bias-optimal  solution. 

From  these  results,  it  appears  natural  to  use  R- 
Learning  for  solving  finite-horizon  MDPs.  The  second 
reinforcement  learning  algorithm  we  propose,  called 
R-H-Learning,  is  an  adaptation  of  R-Learning  with  an 
update  rule  for  states  in  Ss  that  directly  integrates  the 
final  condition  R](,(x,a)  =  rs{x,a)  -  p^  for  x  G  Ss, 
a  G  As' 


For  the  new  MDP  Si  ^  Ss  O  Si  we  proved  the  fol¬ 
lowing  proposition  [Garcia  and  Ndiaye,  1998]; 

Proposition  1  For  the  cycling  process  Si  -r  Ss  O 
Si,  for  all  policy  tv, 

Vxg5i  p^{x)  =  Y,  Vi^{xi)  =  p^ 


Vx  G  Sj, 


i=l . TV  U^x) 

xiGS'i 


V-{x)-{N-id-l)p\ 

0. 


The  first  aspect  of  this  result,  the  state  independence 
of  p",  is  not  surprising  since  the  looping  Ss  O  Si 
transforms  the  original  MDP  into  a  unichain  pro¬ 
cess.  More  interesting  are  the  next  equalities;  the 


Finite-Horizon  R>{-Learning 

Observe  <  x„,a„,y„,rn  > 
Update 


Rn  +  l(3T,a) 
Pn-\-l 


Bn 


R„{x,a)  -I-  a„{x,a).en 
Pn  +  Pn-Ofi 

pn+maXb  /?n  {y>i  >^)  (^»^) 

if  {x,0)  =  (Xn  ,an)  Xn  ,  1<A^ 

\  rji-pn-Rn{x,a) 

if  (x,a)  =  (Xn  ,On  )  Xn  G>.9jV 

0  otherwise 

T-n—pn  4-maXb  Rn  (j/n  ,6)“ /?n  (x„  ,a„  ) 

if  Xn^S,,i<Nan—TIrx{Xn) 

\  r„-pn-fln(Xn,an) 

if  X,,  GSjV  fln  (X„  ) 

0  otherwise 

\ 


If  x„+i  =  7/„  G  Sj  select  0,,+!  in  Aj 
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3  A  learning  rate  analysis  of 
Q^-Learning  and  R^-Learning 

The  simulations  we  conducted  from  a  random  finite- 
MDP  generator  (see  [Garcia  and  Ndiaye,  1998]  and 
section  4)  have  shown  experimentally  that  Qu- 
Learning  and  R?^-Learning  always  converge  to  an  opti¬ 
mal  policy,  and  that  R?{-Learning  is  most  of  the  time 
faster  than  Q-?i-Learning.  However,  it  still  does  not 
exist  any  definitive  theoretical  results  about  the  con¬ 
vergence  of  R-Learning-like  algorithms,  with  a  mixed 
iteration  on  iZ„  and 

The  aim  of  this  section  is  to  introduce  a  comparison 
of  the  respective  learning  rates  of  convergence  of  Q-h- 
Learning  and  R^^-Learning.  The  analysis  we  propose 
below  is  made  possible  by  an  original  equivalent  trans¬ 
formation  of  R-H-Learning  into  a  new  reinforcement  al¬ 
gorithm,  the  form  of  which  is  closer  to  Q-w-Learning. 
We  first  present  that  transformation. 


As  we  can  see,  the  two  series  {Qn}n  and  {pn}n 
are  now  decoupled.  Furthermore,  since  7r„(a:)  = 
argmax^  jRn(a:,a)  =  axgmaXo(Q„(x,a)  —  {N  -  i  + 
l)Pn)  =  argmax^  Qn{x,a),  the  {pn}n  iteration  is  even 
not  necessary  to  determine  the  current  policy  •Kn- 

Hence  the  two  algorithms  Q-^-Learning  and  R^^- 
Learning  in  finite-horizon  can  be  considered  as  two 
different  updating  rules  of  the  same  value  function 
Q.  More  precisely,  the  main  difference  between  Q^^- 
Learning  and  R-n-Learning  can  now  be  clearly  associ¬ 
ated  to  the  number  of  components  Qn{x,a)  which  are 
modified  at  each  iteration  of  the  algorithm.  In  Q-^- 
Learning,  we  only  update  the  component  Qn(xn,an)- 
In  R?^-Learning,  if  (x,a)  ^  (xn,an),  Q(x,a)  can  still 
be  updated  if  the  action  o„  corresponds  to  a  greedy 
action  for  the  state  Xn  in  the  policy  7r„. 

3.2  Reinforcement  Learning  and  the  ODE 
method 


3.1  An  equivalent  formulation  of 
RT^-Learning 

Just  consider  the  second  equation  of  Proposition  1.  It 
sets  a  direct  relation  between  the  value  functions  V"^ 
and  of  a  policy  tt,  that  can  be  directly  translated 
in  terms  of  functions  R"  and  Q’'  :  Vx  €  Si,  Va  6 
Ai  Rf{x,a)  —  QJ{x,a)  -  {N  -  i  +  l)p'^.  Prom  that 
observation  we  propose  to  transform  the  iteration  on 
Rn  in  the  R'H-Learning  algorithm  by  an  iteration  on 
Qn-  With  this  aim,  we  define  the  new  series  {Qn}n: 

Var  e  Si,  Vo  6  Ai  Qn{x,a)  =  Rn{x,a)  +  {N -i-\-l)pn 

(4) 

with  {Rn}n  and  {pn}n  the  two  series  of  the  R?<- 
Learning  algorithm.  That  transformation  leads  to  the 
following  equivalent  reinforcement  algorithm; 


Finite-Horizon  R^-Learning  -  Q  formulation  _ 


Qn+l  {x,  o) 
Pn+1 


ln{x,a) 


Qn{x,a)  +^n{x,a).en  (5) 

Pn  "t"  Pn-^n  ff  0,n  —  T^n{Xn) 

{^n ni&Xb  Qni^Vn  Qn  n 
if  Xn€Si,i<N 
Tn  Qn(®n,^^n)  if  Xn&S^f 

anix,a)+{N-i+l)Pn 
if  {3;  =  ,  CC,,  ,  On  — X'n  (sJn  ) 

a„(a;,a) 

^  if  (3;,o)  =  (a;n  ,On  ) ,  On  ^Xn  (Xn  ) 

(N-i+l)0n 

if  (,X  ,Q,')^{^Xn  1  X^Si ,  0.n—'^ni,Xn) 

0  Otherwise 


Now  that  we  have  seen  that  R?i-Learning  is  a  par¬ 
allel  version  of  Q'H-Learning,  we  intend  to  compare 
their  respective  rates  of  convergence.  The  theoret¬ 
ical  tool  we  have  chosen  is  the  Ordinary  Differen¬ 
tial  Equation  (ODE)  method  recently  introduced  in 
reinforcement  learning  [Bertsekas  and  Tsitsiklis,  1996, 
Kushner  and  Yin,  1997].  The  ODE  method  results 
from  the  combination  of  dynamical  systems  and 
stochastic  approximation  techniques.  The  classical 
theory  of  stochastic  approximation  introduced  by  Rob¬ 
bins  and  Monro  [Robbins  and  Monro,  1951]  concerns 
the  analysis  of  adaptive  stochastic  algorithms 

Sn+l  —  d" '7nJ^(^ni  An+l)  (6) 

where  is  the  parameter  vector,  and  X„  the  in¬ 
put  random  vector  bringing  some  information  on 
at  time  n.  The  application  of  this  theory  to  the 
domain  of  Reinforcement  Learning  has  led  to  gen¬ 
eral  proofs  of  convergence  for  Q-Learning  or  TD{\) 
[Jaakkola  et  ah,  1994,  Tsitsiklis,  1994].  The  ODE 
method  was  initially  proposed  by  Ljung  [Ljung,  1977], 
and  then  has  been  the  source  of  many  works,  as  in 
[Kushner  and  Clark,  1978,  Benai'm,  1996].  It  consists 
in  the  introduction  of  the  averaged  differential  equa¬ 
tion  ^  =  H{d)  where  Tl{d)  =  lim  E[H(d,Xn)],  the 
at  n->co 

behaviour  of  which  can  be  compared  to  the  asymptotic 

behaviour  of  (6). 

The  use  of  the  ODE  method  for  analysing  learning 
algorithms  like  neural  nets  has  originally  been  intro¬ 
duced  by  Benai'm  [Benalm,  1995].  An  application  to 
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the  analysis  of  reinforcement  learning  algorithms  has 
already 

been  considered  in  [Bertsekas  and  Tsitsiklis,  1996, 
Kushner  and  Yin,  1997],  where  convergence  analysis 
of  Q-Learning  are  presented.  The  point  we  want  to 
emphasize  in  this  article  is  that  the  ODE  method  can 
also  be  applied  to  study  the  learning  rates  of  rein¬ 
forcement  learning  algorithms  like  QT^-Learning  and 
R7i-Learning. 

The  representation  we  adopt  for  this  study  is  the  fol¬ 
lowing:  the  parameter  vector  6  to  estimate  is  the  opti¬ 
mal  value  function  Q*,  Xn  represents  the  observation 
at  time  n,  and  is  defined  as  =  {xn-i,a„^i,Xn), 
H{)  is  the  update  rule  of  Q-^-Learning  in  finite- 
horizon.  Here  H{Q,  {x,a,y))  is  set  to  the  vector: 

r(a:,a)4-maxfe  Q{y,b)—Q(x,a)  if  xeSi,i<N 
r(i,a)  — Q(i,a)  if  x£Ss 

\  0  / 

Thus  the  two  algorithms  Q-H-Learning  and  R-^- 
Learning  can  be  described  as 


functions,  where  a„  depends  probabilistically  of  Xn 
and  Q„,  is  given  by: 

VX  =  {x,a,x'),  p^(X)-=  iL%{x)P2:p{a  \  x)P{x'  \  x,a) 

where  is  the  stationary  distribution  of  the  Markov 
chain  {x„}„  defined  by  Pix'  |  x)  =  T,a€A,Pii^'  I 
x,a)Pp^p{a  I  x)  for  x  e  Si,  and  Pe%(an  |  Xn)  is  the  se¬ 
lection  probability  of  the  exploration  function.  Since 
we  added  a  uniform  return  form  to  Si,  we  have 
Vx  e  Si,  y.%{x)  =  .  Iteratively, 

puted  on  each  state-space  Si  as:  Vx  €  Si+i,  lJ-%{x)  = 

Ezes.  I 

We  easily  check  that  Q*  is  a  stable  attractive  point  of 
(8).  First  we  can  see  that  h{Q*)  =  0.  Moreover,  we 
can  calculate  the  Jacobian  matrix  of  h  on  that  point: 

hQiQl  =  =  (9) 

(  \ 

—  if  (i,n)  =  (x',n') 

p,(i'|i,Q)/j"(i,n)  \i  xCSi,i<N, 

■■■  x'eS,  +  ,,a'  =  ;r-(x') 

'  0  Otherwise  , 


Qn+l=Qn  +  -TH{Qn,Xn+i)  (7) 

n 

where  F  =  F'^'”  or  F^^  is  an  adaptive  gain  matrix. 
For  /?n  =  0  and  a„(x,a)  =  i,  F'^”  =  F^’'  =  /  which 
corresponds  to  the  sim.plest  version  of  Q-H-Learning. 
Therefore,  within  the  ODE  method,  Q-H-Learning  and 
R-H-Learning  can  be  considered  as  two  discrete  approx¬ 
imations  with  adaptive  matrix-valued  gains  F‘^’<  and 
F^’'  of  the  same  differential  equation: 

^  (8) 

with  h(Q)  =  lim  EQ[H{Q,Xn)]-  h{Q)  can  be  cal- 

n-Foo 

culated  from  the  stationary  distribution  of  the 
Markov  chain  {X„}„  given  a  constant  parameter  Q: 
h(Q)  =  ^P(Q,X)/x<3(X). 

X 

To  calculate  the  stationary  distribution  we  have  to 
take  into  account  the  fact  that  for  reinforcement  learn¬ 
ing  algorithms,  the  input  sequence  {X„}„  is  a  Markov 
process  controlled  by  the  parameter  vector  Qn  itself 
[Benveniste  et  ah,  1990].  For  Markovian  exploration 


with  n'lx)  =  argmax^Q‘(x,a)  and  ^’(x,a)  = 

pf  (x)P^p(a  I  x).  The  eigenvalues  of  hQ{Q*)  arc 
equal  to  -/i*(x,o).  They  are  strictly  negative  with 
the  simple  assumptions  that  the  Markov  chain  {x„}„ 
is  recurrent  at  Q  =  Q* ,  and  that  Vx,  a  P^pia.  j  x)  >  0. 

Based  on  that  material,  we  can  now  focus  on  the 
problem  of  the  learning  rate  analysis,  and  its  appli¬ 
cation  to  the  comparison  between  QT^-Learning  and 
RT^-Learning  in  finite-horizon  MDPs. 

3.3  Optimal  matrix-valued  learning  rates 

The  use  of  a  matrix-valued  gain  to  guide  and  acceler¬ 
ate  the  convergence  of  a  stochastic  adaptive  algorithm 
is  a  classic  result  of  stochastic  approximation  theory 
[Benveniste  et  ah,  1990,  Kushner  and  Yin,  1997]. 

For  the  algorithm  (7)  the  gain  matrices  F  that  main¬ 
tain  Q*  as  a  stable  equilibrium  of  the  new  ODE 

f  =  "‘W*' 

are  characterized  by  VA  eigenvalue  of  + 
F./iq(Q*),  Tle{X)  <  0.  Among  all  these  matri¬ 
ces,  it  is  possible  to  prove  [Benveniste  et  ah,  1990, 
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Kushner  and  Yin,  1997]  that  the  one  that  minimizes 
the  asymptotic  variance  lim  \\Qn  -  Q*||^  is  defined 

n— foo 


by 


r*  =  -hQ-'(Q*) 


(10) 


As  we  can  see  the  knowledge  of  the  target  parameter 
Q*  is  generally  required  and  an  adaptive  matrix- valued 
gain  r„  that  converges  toward  F*  is  often  used.  In 
our  case,  F*  can  be  calculated  by  inverting  (9).  We 
obtain  the  following  upper  triangular  optimal  matrix 
[Garcia  and  Ndiaye,  1998]: 


( 


F* 


(i,a) 


V 


(*'A) 


(3:'|a:,a) 


K  0 


if  {x^a)={x\a') 

if  x€Si,x'€Sj, 
i<j,  a'=-K*  (a:') 

Otherwise 


\ 


where  P^{x'  \  x,a)  is  the  probability  of  going  from 
X  e  Si  to  x'  e  Sj,  1  <  i  <  j  <  N,  in  j  -  i  steps,  by 
first  executing  the  action  a,  and  then  by  following  the 
current  policy  7r'^(x)  =  argmax^  (5(a:,  a)  for  the  last 
j  -  i  -  1  steps. 


3.4  Comparison  between  F'^’* ,  F^’*  and  F* 


For  the  two  algorithms  Q-w-Learning  (2)  and  Rw 
Learning  (5)  that  we  consider  in  this  paper,  the  gains 
F««  and  F^«  are  adaptive  gains  that  depend  on  n, 
but  also  on  and  Qn- 


In  order  to  be  able  to  compare  these  matrix-valued 
gains  with  the  optimal  gain  F* ,  it  is  necessary  to  con¬ 
sider  their  asymptotic  behaviour.  If  we  assume  classi¬ 
cally  that  a„(x,a)  =  where  N(x,a)  is  the  total 

number  of  times  the  pair  (x,  a)  was  visited  at  time  n, 
and  Sn  =  we  show  in  [Garcia  and  Ndiaye,  1998] 
that  F'^’^  and  F'^’'  converge  respectively  toward: 


yQ-H  — 

OO 


-- 


o\ 


Vo  -J 


A  first  remark  about  these  matrix-valued  gains  is  that 
the  stability  condition  on  Q*  implies  that  Oq  >  5.  This 
explains  some  empirical  results  concerning  R-Learning 
which  reveal  that  higher  initial  values  of  ao  are  to  be 
preferred  to  lower  values  [Mahadevan,  1996b]. 

It  is  now  interesting  to  compare  F^'” ,  F^’^  and  F* .  For 
ao  =  1  and  a  small  /3o,  the  three  matrices  have  more  or 
less  the  same  diagonal  values,  which  is  a  confirmation 
of  the  good  choice  a„(a;,a)  =  77^^^,  asymptotically 
equivalent  to 

Another  important  similarity  between  F^’^  and  F*  is 
about  the  structure  of  the  matrices  :  both  of  them 
have  exactly  the  same  null  columns. 


4  Simulations 


In  order  to  experimentally  compare  Q-H-Learning,  Rt^- 
Learning  and  the  F*Q-Learning  corresponding  to  (7) 
with  F  =  F* ,  we  have  developed  a  random  finite-MDP 
generator.  At  each  step  i,  a  set  5,  of  ns  states  and  a 
set  Ai  of  ua  actions  are  defined.  Each  transition  from 
Si  to  Sj+i  is  characterized  by  a  set  of  transition 
matrices  pi{.  j  .,0)  and  ua  reward  vectors  rj(.,  o).  The 
problem  parameters  are  N,  ns  and  The  reward 
values  ri{s,a)  and  the  probabilities  pi{s'  j  s,a)  are 
drawn  in  [0, 1]  from  a  random  number  generator,  with 
the  constraints  Pii^'  j  S) «)  =  1- 

For  a  given  random  MDP,  we  first  calculate  the  exact 
finite-horizon  optimal  policy  tt*  with  the  classical  N- 
step  backward  dynamic  programming  algorithm,  using 
the  Pi  and  r,  values.  Then  we  calculate  p*,  ,  and 

finally  the  F*  optimal  gain. 

We  evaluated  the  performance  of  the  3  algorithms 
Learning,  RT^-Learning  and  F*Q-Learning  on  different 
random  MDPs  [Garcia  and  Ndiaye,  1998]. 

The  learning  parameters  a„  et  were  defined  by 


Cl) 


Q!0 

N{x,  a) 


and  Sn  = 


Po 

n/N’ 


( 


(a;,o) 


*  (®,a) 


JPJxiaj 
(iV-i+l)/3o 
,  0 


pil-H  — 

OO 

:{x',a') 

{N-i+l)0o 

if  {x,a)={x' ,a'),xeSi  ,a=7r*(x) 
if  (x,a)=(x',a'), o#7r’(x) 
if  (x,a):^{x' ,o,'},xeSi,a'=v"{x') 
if  {x,a):^{x' ,a'),a'^7r'(x') 


where  N{x,  a)  is  the  number  of  times  the  action  a  has 
been  chosen  in  the  state  x.  We  used  ao  =  1  and  Po  = 
0.4  for  all  simulations,  with  a  semi-uniform  exploration 
function  (r  =  90%).  These  choices  of  learning  rates 
were  made  to  optimize  the  behaviour  of  Q-H-Learning 
and  R-H-Learning  on  the  set  of  problems  we  considered. 

We  chose  p„^  =  77^  ExeSi  (^)  ^  performance 

measure  of  the  current  policy  7r„  at  iteration  n,  and 


V 
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the  variability  of  this  policy  for  different  learn¬ 
ing  trajectories  was  taken  into  account  by  calculat¬ 
ing  the  mean  of  the  value  /9^„  on  M  different  runs 
(we  took  M  =  5).  More  precisely  we  considered 
for  each  algorithm  the  two  criteria  C\  :  for 

n  =  500A^,  and  C2  ■  pn„lp*  for  =  SOOOAr,  where 
p*  =  J2x€Si  ^r(^)  optimal  average  gain  of 

TT*. 

The  first  surprising  fact  we  noticed  was  that  most  of 
the  time  the  r*Q-Learning  did  not  converge.  An  ex¬ 
planation  we  found  is  that  for  large  sized  problems,  the 
initial  gains  of  r*Q-Learning  are  too  large, 

and  make  the  series  {Qn}n  leaves  its  convergence  do¬ 
main.  To  fix  that  problem  we  decided  to  replace  T* 
by  an  adapative  matrix  T*  asymptotically  equivalent 
to  r*,  where  -wr^  is  used  instead  of  — r.  The  re- 
suits  we  finally  obtained  showed  that  T*  Q-Learning  is 
a  bit  better  than  -Learning,  and  that  both  of  them 
are  always  faster  than  Q-n-Learning,  as  illustrated  in 
table  3  and  figure  4. 


wamamm 

(n  =  500  A' ) 
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noil 
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Figure  3:  Relative  evaluation  of  Q-H-Learning,  R-^- 
Learning  and  F*  Q-Learning. 


Figure  4:  n/i=10,  n5=300,  N=5  (5  runs). 


5  Conclusion 

The  underlying  goal  of  this  article  was  to  tackle  the 
problem  of  using  reinforcement  learning  algorithms  in 
the  framework  of  finite-horizon  Markov  Decision  Pro¬ 
cesses.  Two  main  results  have  been  obtained. 

First  w'e  proved  the  equivalence  between  the  total  re¬ 
ward  criterion  and  the  average-reward  criterion  in  fi¬ 
nite  horizon.  An  interesting  conclusion  is  that  classical 
Q-Learning  and  R-Learning  algorithms  can  be  adapted 
to  define  Q^^-Learning  and  R-n-Learning  algorithms  in 
finite  horizon.  Both  of  these  algorithms  converge  ex¬ 
perimentally  toward  the  optimal  V*  value  functions, 
with  a  conv'ergence  proof  for  QT^-Learning. 

The  other  important  result  is  about  the  comparison 
between  the  learning  rates  of  Q-w-Learning  and  R>^- 
Learning.  It  appears  that  R-^-Learning  can  be  seen 
as  a  version  of  Q’H-Learning  using  matrix-valued  step- 
sizes,  where  several  components  of  the  Q  function  are 
updated  simultaneously.  Furthermore,  we  showed  that 
this  stepsize  matrix  is  structurally  and  numerically 
very  close  to  the  optimal  gain  matrix  proposed  by  the 
ODE  method,  and  that  R>^-Learning  performs  very 
similarly  to  the  learning  algorithm  corresponding  to 
this  optimal  gain  matrix.  Consequently  we  argue  in 
favor  of  using  R^^-Learning  when  solving  finite-horizon 
MDPs. 

Different  open  questions  still  deserve  to  be  considered. 
First  we  would  like  to  know  whether  it  is  possible  to 
derive  from  F*  Q-Learning  an  equivalent  reinforcement 
learning  algorithm  where  only  one  component  is  up¬ 
dated  at  each  state  transition,  like  it  is  the  case  for 
R-H-Learning  in  its  initial  formulation.  For  the  mo¬ 
ment,  independently  of  the  fact  that  it  requires  to 
know  and  p,*,  F*  Q-Learning  cannot  be  used  in 
practice  since  it  is  too  much  slow. 

Another  question  we  are  currently  considering  is  to  ex¬ 
ploit  the  equivalent  Q  formulation  of  R-^-Learning  for 
proving  its  convergence.  Some  recent  theoretical  re¬ 
sults  concerning  the  ODE  method  could  be  sufficient, 
like  in  [BenaYm  et  ah,  1998]. 

Finally,  we  are  trying  to  generalize  our  results  concern¬ 
ing  the  convergence  of  Q-^-Learning  and  R^i-Learning 
to  Q-Learning  and  R-Learning  within  the  classical 
framework  of  stationary  infinite  MDPs. 
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Abstract 

How  can  we  guarantee  that  our  software  and 
robotic  agents  will  behave  as  we  require,  even 
after  learning?  Formal  verification  should 
play  a  key  role  but  can  be  computationally 
expensive,  particularly  if  re-verification  fol¬ 
lows  each  instance  of  learning.  This  is  espe¬ 
cially  a  problem  if  the  agents  need  to  make 
rapid  decisions  and  learn  quickly  while  on¬ 
line.  Therefore,  this  paper  presents  novel 
methods  for  reducing  the  time  complexity  of 
re-verification  subsequent  to  learning.  The 
goal  is  agents  that  are  predictable  and  can 
respond  quickly  to  new  situations. 

1  INTRODUCTION 

Software  and  robotic  agents  are  becoming  increasingly 
prevalent.  Agent  designers  can  furnish  such  agents 
with  plans  to  perform  desired  tasks.  Nevertheless, 
a  designer  cannot  possibly  foresee  all  circumstances 
that  will  be  encountered  by  the  agent.  Therefore,  in 
addition  to  supplying  an  agent  with  plans,  it  is  es¬ 
sential  to  also  enable  the  agent  to  learn  and  mod¬ 
ify  its  plans  to  adapt  to  unforeseen  circumstances. 
The  introduction  of  learning,  on  the  other  hand,  of¬ 
ten  makes  the  agent’s  behavior  significantly  harder  to 
predict.  Our  objective  is  to  develop  methods  that  pro¬ 
vide  verifiable  guarantees  that  the  behavior  of  learning 
agents  always  remains  within  the  bounds  of  specified 
constraints  (called  “properties”),  even  after  learning. 
An  example  of  a  property  is  Asimov’s  First  Law  of 
Robotics  (Asimov,  1942).  This  law,  which  has  recently 
been  studied  by  Weld  and  Etzioni  (1994),  states  that 
a  robot  may  not  harm  a  human  or  allow  a  human  to 
come  to  harm.  Weld  and  Etzioni  advocate  a  “  ‘call 


to  arms;’  before  we  release  autonomous  agents  into 
real-world  environments,  we  need  some  credible  and 
computationally  tractable  means  of  making  them  obey 
Asimov’s  First  Law. ..how  do  we  stop  our  artifacts  from 
causing  us  harm  in  the  process  of  obeying  our  orders?” 
Asimov’s  law  can  be  operationalized  into  specific  prop¬ 
erties  testable  on  a  system,  e.g.,  “Never  delete  another 
user’s  file.”  This  paper  addresses  Weld  and  Etzioni’s 
“call  to  arms”  in  the  context  of  adaptive  agents.  It  is 
a  very  important  topic  for  real-world  agents  and  is  a 
dominant  theme  in  science  fiction,  which  is  sometimes 
prescient.  Examples  include  the  Borgs  (Star  Trek,  The 
New  Generation),  Bolos  (Laumer,  1976),  and  Berserk¬ 
ers  (Saberhagen,  1967)  -  fictional  agents  that  demon¬ 
strate  the  dangerous  behavior  that  can  result  from  in¬ 
sufficient  constraints. 

We  assume  that  an  agent’s  plan  has  been  initially  veri¬ 
fied  offline.  Then,  the  agent  is  fielded  and  has  to  adapt 
online.  After  adaptation  via  learning,  the  agent  must 
rapidly  re-verify  its  new  plan  to  ensure  this  plan  still 
satisfies  required  properties.*  Re-verification  must  be 
as  computationally  efficient  as  possible  because  it  is 
performed  online,  perhaps  in  a  highly  time-critical  sit¬ 
uation.  There  are  numerous  applications  of  this  sce¬ 
nario,  including  software  agents  that  can  safely  ac¬ 
cess  information  in  confidential  or  proprietary  environ¬ 
ments  while  responding  to  rapidly  changing  access  re¬ 
quirements,  planetary  rovers  that  quickly  adapt  to  un¬ 
foreseen  planetary  conditions  but  behave  within  criti¬ 
cal  mission  constraints,  and  JAVA  applets  that  can  get 
smarter  but  not  become  destructive  to  our  computing 
environments. 

Typically,  properties  desired  by  a  user  are  orthogo¬ 
nal  to  both  the  agent’s  planning  goals  and  its  learning 

'Current  output  is  success/failurc.  Future  work  will 
consider  using  re-verification  counterexamples  to  choose  a 
better  learning  method  when  re-verification  fails. 
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goals.  For  example,  the  agent  may  generate  a  plan 
with  the  objective  of  maximizing  the  agent’s  profit. 
Learning  might  have  the  goal  of  achieving  the  agent’s 
plan  more  efficiently  or  modifying  the  plan  to  adapt 
to  unforeseen  events.  The  designer  may  have  an  addi¬ 
tional  constraint  that  the  agent  does  not  cheat  in  its 
dealings  with  other  agents.  Why  doesn’t  the  planner 
incorporate  all  properties  into  the  plan?  There  are  a 
number  of  possible  reasons,  e.g.,  not  all  properties  may 
be  known  at  the  time  the  plan  is  developed,  or  security 
reasons. 

Re-verification  can  be  (from  least  to  most  time  re¬ 
quired):  none,  incremental,  or  complete.  It  is  pos¬ 
sible  to  avoid  re-verification  entirely  if  we  restrict  the 
agent  to  using  only  those  learning  methods  determined 
a  priori  to  be  “safe”  with  respect  to  certain  classes  of 
properties  in  which  we  are  interested.  In  other  words, 
if  a  plan  satisfies  a  property  prior  to  learning,  we  want 
an  a  priori  guarantee  that  the  property  will  still  be 
satisfied  subsequent  to  learning.  Note  that  this  incurs 
no  run-time  cost.  It  is  called  “moving  a  tester  into  the 
generator”  or  “compiling  constraints.” 

Unfortunately,  the  safety  of  some  learning  methods 
may  be  very  difficult  or  maybe  impossible  to  deter¬ 
mine  a  priori.  When  a  priori  determination  is  too  dif¬ 
ficult,  it  is  helpful  to  use  incremental  re-verification. 
Incremental  methods  save  computational  costs  over 
re-verification  from  scratch  by  localizing  re- verification 
and/or  by  reusing  knowledge  from  the  original  verifica¬ 
tion.  Furthermore,  incremental  methods  may  identify 
positive  results  that  cannot  be  determined  a  priori. 
When  an  agent  needs  to  learn,  we  suggest  that  the 
agent  should  consult  the  a  priori  results  first.  If  no 
positive  results  exist,  then  incremental  re- verification 
proceeds.  The  least  desirable  of  the  three  alternatives 
is  to  do  complete  re- verification  from  scratch. 

Gordon  (1997a)  begins  to  explore  the  extent  to  which 
we  can  prove  a  priori  results  that  certain  machine 
learning  operators  are,  or  are  not,  safe  for  certain 
classes  of  properties.  The  paper  has  positive  a  priori 
results  for  plan  efficiency  improvements  via  deletion 
of  plan  elements,  as  well  as  for  plan  refinement  meth¬ 
ods.  Unfortunately,  we  have  not  yet  obtained  positive 
a  priori  results  for  popular  machine  learning  operators 
such  as  abstraction  (unless  one  is  willing  to  accept  an 
abstracted  property)  or  generalization.  Abstraction 
is  a  more  global  operator  than  generalization.  Ab¬ 
straction  alters  the  language  of  a  plan  (e.g.,  by  feature 
selection),  whereas  generalization  alters  the  condition 
for  a  state-to-state  transition  within  a  plan.  Both  are 
extremely  common  operators  in  concept  learning,  but 


are  also  very  appropriate  for  plan  modification. 

This  paper  has  two  contributions  beyond  (Gordon, 
1997a).  First,  the  previous  paper  models  agent  plans 
using  automata  on  infinite  strings.  This  paper  reaches 
a  wider  audience  by  using  the  more  familiar  automata 
on  finite  strings.  Second,  this  paper  addresses  two, 
new  questions:  Are  there  situations  in  which  an  ab¬ 
stracted  property  is  acceptable?  If  yes,  we  have  pos¬ 
itive  a  priori  results  for  abstraction.  Also,  can  we 
get  positive  results  by  using  incremental  re- verification 
rather  than  a  priori?  Initial,  positive  answers  to  these 
questions  are  presented  here. 

The  remainder  of  this  paper  is  organized  as  follows. 
Section  2  presents  an  illustrative  example  that  is  used 
throughout  the  paper.  ^  Section  3  contains  back¬ 
ground  material  and  definitions  on  automaton  plans, 
temporal  logic  properties,  and  “safe”  learning.  The 
formal  definitions  provide  a  precise  foundation  for  un¬ 
derstanding  the  incremental  re-verification  methods 
presented  later.  Section  4  lists  situations  in  which 
property  abstraction  is  acceptable.  Sections  4  and  5 
present  novel  (and  cis  far  as  we  are  aware,  the  only) 
methods  for  incremental  re-verification  of  abstraction 
and  generalization,  respectively,  on  automata.  Finally, 
time  complexity  comparisons  between  incremental  and 
complete  re- verification  are  provided. 

2  ILLUSTRATIVE  EXAMPLE 

This  section  provides  an  example  to  illustrate  some 
of  the  main  ideas  of  the  paper.  Although  the  plan 
in  this  example  is  very  small,  it  is  important  to  point 
out  that  existing  automata-based  verification  methods 
currently  handle  huge,  industrial-sized  problems  (e.g., 
see  Kurshan,  1994).  Our  goal  is  to  improve  the  time 
complexity  of  verification  over  current  methods  when 
learning  occurs. 

In  our  example,  hundreds  of  tiny,  micro  air  vehicles 
(MAVs)  are  required  to  perform  a  task  within  a  region. 
The  MAVs  are  divided  into  two  groups  called  “swarm 
A”  and  “swarm  B.”  One  constraint,  or  property,  is  that 
only  one  MAV  may  enter  the  region  at  a  time  -  because 
multiple  MAVs  entering  simultaneously  would  increase 
the  risk  of  detection.  Each  swarm  has  a  separate  FIFO 
queue  of  MAVs.  MAVs  enter  the  queue  when  they 
return  from  their  last  task.  A  second  constraint  is  that 
some  (at  least  one)  MAVs  from  each  swarm  eventually 

^Examples  in  this  paper  have  been  implemented  us¬ 
ing  Kurshan’s  COSPAN  verification  system.  COSPAN  is 
an  AT&T  verification  tool,  which  is  described  in  Kurshan 
(1994). 
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enter  the  region.  One  distingui.shed  MAV,  C,  acts  as  a 
task  coordinator.  C  selects  which  swarm,  A  or  B,  may 
send  in  an  MAV  next.^ 

Plans  for  swarm  A  and  task  controller  C  are  shown  in 
Figures  1  and  2.  The  plan  for  swarm  B  is  not  shown 
in  the  figure,  but  it  is  identical  to  the  plan  for  A  ex¬ 
cept  all  instances  of  “A”  are  replaced  by  “B.”  Each 
of  these  plans  is  a  finite-state  automaton,  i.e.,  a  graph 
with  states  (the  vertices)  and  allowable  state-to-state 
transitions  (the  directed  edges  between  vertices).  The 
transition  conditions  (i.e.,  the  logical  expressions  label¬ 
ing  the  edges)  describe  the  set  of  actions  that  enable  a 
state  transition  to  occur.  The  possible  actions  A  can 
take  from  a  state  are  (A;no-MAVs),  (A:MAVs-wait), 
or  (A:MAVs-go).  The  first  action  means  the  queue  is 
empty,  the  second  that  the  queue  is  not  empty  but 
the  MAVs  in  the  queue  must  wait,  and  the  third  that 
the  first  MAV  in  the  queue  enters  the  region.  Likewise 
for  B.  The  possible  actions  C  can  take  from  a  state 
are  (C:go-A)  or  (C:go-B).  The  first  action  means  con¬ 
troller  C  allows  swarm  A  to  send  one  MAV  into  the 
region,  the  second  means  C  allows  B  to  send  one  MAV 
into  the  region. 

Swarms  A  and  B  are  single  agents,  i.e.,  although  indi¬ 
vidual  MAVs  may  each  have  their  own  plan,  such  as 
queuing  within  a  swarm,  for  simplicity  we  ignore  that 
level  of  detail.  We  can  form  a  multiagent  plan  by  tak¬ 
ing  a  “product”  (see  Section  3.1)  of  the  plans  for  A,  B, 
and  C.  This  product  synchronizes  the  behavior  of  A, 
B,  and  C  in  a  coordinated  fashion.  At  every  discrete 
time  step,  every  agent  (A,  B,  C)  is  at  one  state  in  its 
plan,  and  it  selects  its  next  action.  The  action  of  one 
agent  (e.g..  A)  becomes  an  input  to  the  other  agents’ 
plans  (e.g.,  B  and  C).  If  the  joint  actions  chosen  by  all 
three  agents  satisfy  the  transition  conditions  of  a  plan 
from  the  current  state  to  some  next  state,  then  that 

®This  example  is  a  variant  of  the  traffic  controller  in 
Kurshan  (1994). 


transition  may  be  made.  For  example,  if  the  agents 
jointly  take  the  actions  (A:MAVs-wait)  and  (B:MAVs- 
wait)  and  (C:go-A),  then  the  multiagent  plan  can  tran¬ 
sition  from  the  global,  joint  state  (WAIT,  WAIT,  GO- 
A)  to  the  joint  state  (GO,  WAIT,  GO-B)  represented 
by  triples  of  states  in  the  automata  for  agents  A,  B, 
and  C. 

Given  the  full,  multiagent  plan,  verification  now  con¬ 
sists  of  asking  the  question:  Does  this  plan  satisfy  the 
two  required  properties,  i.e.,  some  MAVs  from  each 
swarm  enter  the  region,  but  only  one  MAV  enters  the 
region  at  a  time?  Assuming  our  initial  plan  in  Figures 
1  and  2  satisfies  these  properties,  we  next  ask  whether 
the  properties  are  still  satisfied  subsequent  to  learning. 
The  latter  question  is  the  topic  of  this  paper. 

An  example  of  learning  is  the  following.  Suppose  co¬ 
ordinator  C  discovers  that  the  B  swarm  has  left  the 
region.  One  way  agent  C  can  adapt  to  incorporate 
this  new  knowledge  is  by  deleting  the  action  (C:go-B) 
from  its  action  repertoire.  This  is  a  form  of  abstrac¬ 
tion.  There  are  alternative  modifications  agent  C  can 
do,  but  the  selection  between  these  alternatives  is  a 
learning  issue,  which  we  do  not  address  here.  What 
we  do  addre.ss  here  are  the  implications  of  this  choice, 
in  particular,  which  learning  methods  are  safe,  i.e., 
preserve  the  properties. 

3  PLANS,  PROPERTIES,  AND 
“SAFE”  LEARNING 

3.1  AUTOMATON  PLANS 

This  subsection,  which  is  based  on  Kurshan  (lOOd), 
briefly  summarizes  the  basics  of  the  automata  used 
to  model  plans.  Figures  1  and  2  illustrate  the  defini¬ 
tions.  Essentially,  an  automaton  is  a  graph  with  ver¬ 
tices  corresponding  to  states  and  directed  edges  corre¬ 
sponding  to  state-to-state  transitions.  The  terms  “ver- 
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tex”  and  “state”  are  used  interchangeably  throughout 
the  paper.  For  an  automaton  representing  an  agent’s 
plan,  vertices  represent  the  internal  state  of  the  agent 
and/or  the  state  of  its  external  environment.  State-to- 
state  transitions  have  associated  transition  conditions, 
which  are  the  conditions  under  which  the  transition 
may  be  made.  An  agent  action  that  satisfies  a  transi¬ 
tion  condition  enables  that  transition  to  be  made.  We 
assume  finite-state  automata,  i.e.,  the  set  of  states  is 
finite,  and  that  the  transition  conditions  are  elements 
of  a  Boolean  algebra.  Therefore,  we  briefly  diverge  to 
summarize  the  basics  of  Boolean  algebras. 

A  Boolean  algebra  /C  is  a  set  with  distinguished  ele¬ 
ments  0  and  1,  closed  under  the  Boolean  operations  * 
(logical  “and”),  -f  (logical  “or”),  and  -i  (logical  nega¬ 
tion),  and  satisfying  the  standard  properties  (Kurshan, 
1994). 

The  Boolean  algebras  are  assumed  to  be  finite.  There 
is  a  partial  order  among  the  elements,  ■<,  which  is 
defined  as  x  <y  if  and  only  if  x*y  =  x.  The  elements 
0  and  1  are  defined  as'ix  €  1C,  0  <  x  and  '^x  E  K,,  x  < 
1.  The  atoms  of  K,  r(/C),  are  the  nonzero  elements 
of  K,  minimal  with  respect  to  ::<.  For  two  different 
atoms  X  and  y  within  the  same  Boolean  algebra,  x*y 
=  0.  For  Figures  1  and  2,  agents  A,  B,  and  C  each 
have  their  own  Boolean  algebra  with  its  atoms.  The 
atoms  of  A’s  Boolean  algebra  are  the  actions  (A:no- 
MAVs),  (A:MAVs-wait),  and  (A:MAVs-go);  the  atoms 
of  B’s  algebra  are  (B:no-MAVs),  (B:MAVs-wait),  and 
(B:MAVs-go);  the  atoms  of  C’s  algebra  are  (C:go-A) 
and  (C:go-B). 

A  Boolean  algebra  IC'  is  a  subalgebra  of  tC  if  K'  is  a 
non-empty  subset  of  K-  that  is  closed  under  the  op¬ 
erations  *,  -t-,  and  -I,  and  also  has  the  distinguished 
elements  0,  1.  Let  fC  =  Y\lCi,  i.e.,  K.  is  the  product 
algebra  of  the  K-i .  In  this  case  the  fCi  are  subalgebras 
of  An  atom  of  the  product  algebra  is  the  product  of 
the  atoms  of  the  subalgebras.  For  example,  if  Ui, ...,  a„ 
are  atoms  of  subalgebras  respectively,  then 

oj  *  ...  *  a„  is  an  atom  of  K.. 

In  Figure  1,  the  Boolean  algebra  A  used  by  agent  A 
is  the  smallest  one  containing  the  atoms  of  A’s  alge¬ 
bra.  It  contains  all  Boolean  elements  formed  from  A’s 
atoms  using  the  Boolean  operators  *,  -b,  and  includ¬ 
ing  0  and  1.  These  same  definitions  hold  for  B  and  C’s 
algebras  B  and  C.  One  atom  of  the  product  algebra 
ABC  is  (A;no-MAVs)  *  (B:no-MAVs)  =t=  (C:go-A).  This 
is  the  form  of  actions  taken  by  the  three  agents  in  the 
multiagent  plan.  Algebras  A,  B,  and  C  are  subalge¬ 
bras  of  the  product  algebra  ABC.  Finally,  ABC  is  the 


Boolean  algebra  for  the  transition  conditions  in  the 
multiagent  plan. 

Let  us  return  now  to  automata.  This  paper  focuses  on 
automata  that  model  agents  with  finite  lifetimes  (rep¬ 
resented  as  a  finite  string,  or  sequence  of  actions).  An 
example  is  an  agent  that  is  created  specially  to  exe¬ 
cute  a  plan  and  is  destroyed  immediately  afterwards. 
In  particular,  we  focus  on  processes.  Processes  are 
automata,  but  they  are  the  dual  of  our  usual  notion  of 
an  automaton,  which  accepts  any  string  beginning  in 
an  initial  state  and  ending  in  a  final  state  (Hopcroft  & 
Ullman,  1979).  Instead,  processes  accept  any  string 
beginning  in  an  initial  state  and  ending  in  a  non¬ 
final  state."*  A  string  is  a  sequence  of  actions  (atoms). 
Therefore,  by  specifying  the  set  of  final  states,  we  can 
infer  the  set  of  action  sequences  not  permitted  by  the 
plan.  It  consists  of  those  strings  ending  in  a  final  state. 
All  other  action  sequences  that  begin  in  an  initial  state 
are  permitted  by  the  plan.  Processes  are  used  here  to 
be  consistent  with  the  automata  theoretic  verification 
literature. 

Formally,  a  process  is  a  three-tuple  S  = 
{Mic(S),  I{S),  F{S))  where  K-  is  the  Boolean  algebra 
corresponding  to  S.  Mfc{S)  :  V(S)  x  V(S)  —*  /C  is  the 
matrix  of  transition  conditions,  which  are  elements  of 
/C,  V(S)  is  the  set  of  vertices  of  5,  7(5)  C  V(S)  are 
the  initial  states,  and  F(S)  C  V (5)  are  the  final  states. 
Also,  F(S)  =  {e  e  V(S)  x  V(S)  |  Mx:(e)  0}  is  the 
set  of  directed  edges  connecting  pairs  of  vertices  of  5, 
and  Mx(e)  is  the  transition  condition  of  Mx(S)  corre¬ 
sponding  to  edge  e.  Note  that  we  omit  edges  labeled 
“0.”  By  our  definition,  an  edge  whose  transition  con¬ 
dition  is  0  does  not  exist.  We  can  alternatively  denote 
Mx(e)  as  Mx{vi,  n,q.i)  for  the  transition  condition  cor¬ 
responding  to  the  edge  going  from  vertex  Vi  to  vertex 
Vj+i.  For  example,  in  Figure  1,  Mx  (WAIT,  GO)  is 
(A:  MAVs-wait)  *  (C:  go-A). 

Figures  1  and  2  illustrate  the  process  definitions. 
There  are  process  plans  for  two  agents:  swarm  A  and 
task  coordinator  C.  Recall  that  agent  B  is  identical 
to  A  but  with  “A”  replaced  by  “B.”  An  incoming  ar¬ 
row  to  a  state,  not  from  any  other  state,  signifies  that 
this  is  an  initial  state.  Recall  that  the  output  actions 
of  process  A  are  its  atoms,  and  likewise  for  processes 
B  and  C.  The  transition  conditions  are  the  labels  on 
the  edges.  We  assume  for  process  A  =  A,  B,  or  C, 
F{X)  =  0,  i.e.,  there  are  no  final  states.  Therefore 
every  finite  string  of  actions  that  starts  in  an  initial 

*For  the  case  of  deterministic  and  complete  transition 
conditions,  reversing  the  acceptance  condition  will  comple¬ 
ment  the  language. 
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state  and  satisfies  the  transition  conditions  is  accept¬ 
able  behavior  for  the  plan. 

A  multiagent  plan  is  formed  from  single  agent  plans  by 
taking  the  tensor  product  of  the  processes  correspond¬ 
ing  to  the  individual  plans.  Essentially,  this  is  done 
by  taking  the  Cartesian  product  of  the  vertices  and 
the  intersection  of  the  transition  conditions.  For  de¬ 
tails  see  Kurshan  (1994).  The  product  process  models 
a  set  of  synchronous  processes.  The  Boolean  algebra 
corresponding  to  the  product  process  is  the  product 
algebra.  For  Figures  1  and  2,  to  formulate  the  process 
S  modeling  the  entire  multiagent  plan,  we  take  the  ten¬ 
sor  product  S  =  A(2)B®Cof  the  three  processes.  For 
this  tensor  product,  7(S)  =  {  (WAIT,  WAIT,  GO-A), 
(WAIT,  WAIT,  GO-B)  },  and  F(S)  =  0.  The  tensor 
product  process  is  not  shown  in  a  figure  because  it’s 
quite  large. 

Formally,  a  string  x  is  a  finite-dimensional  vector, 
{xo,...,x„)  G  r(/C)'^,  i.e.,  a  string  is  a  sequence  of 
one  or  more  actions.  A  run  v  of  string  x  is  a  se¬ 
quence  (uo,  .,.,u„+i)  of  vertices  such  that  Vi,  0  <  i  <  n, 
Xi  *  MK{vi,Vi+i)  0,  i.e.,  Xi  <  MK:ivi,Vi+i)  because 
the  Xi  are  atoms. 

The  language  of  S  is  C{S)  =  {x  €  r(/C)’*’  |  x  has  a 
run  in  Mic{S)  from  I{S)  to  V(5)\F(5)}.  Such  a  run 
is  accepting.  The  language  of  a  plan  is  the  set  of  all 
action  sequences  (i.e,  strings)  allowed  by  the  plan. 

An  example  string  in  the  language  of  process  S, 
the  multiagent  process  that  is  the  product  of  A, 
B,  and  C,  is  (((A:MAVs-wait)  *  (B:MAVs-wait)  * 
(C;go-A)),  ((A:MAVs-go)  +  (B:MAVs-wait)  *  (C:go- 
B)),  ((A:MAVs-wait)  *  (B:MAVs-go)  +  (C:go-B)), 
((A:MAVs-wait)  +  (B:MAVs-go)  *  (C:go-A))).  This  is 
a  sequence  of  atoms  of  S.  An  accepting  run  of  this 
string  is  ((WAIT,  WAIT,  GO-A),  (GO,  WAIT,  GO- 
B),  (WAIT,  GO,  GO-B),  (WAIT,  GO,  GO-A),  (GO, 
WAIT,  GO-A)).  Because  F(S)  =  0,  all  runs  beginning 
in  an  initial  state  are  accepting  runs  and  they  form  the 
elements  of  the  language  of  S. 

3.2  TEMPORAL  LOGIC  PROPERTIES 

We  assume  properties  are  expressed  in  temporal  logic. 
For  formal  versions  of  the  definitions  here,  see  Manna 
and  Pnueli  (1991).  Linear  time  is  assumed  here.  In 
other  words,  time  proceeds  linearly  and  we  do  not 
consider  simultaneous  possible  futures.  The  type  of 
verification  used  in  this  paper  is  “model  checking.”  In 
other  words,  verification  tests  whether  S  f=  F  for  plan 
S  and  property  P,  i.e.,  whether  plan  S  “models,”  or 
satisfies,  property  P. 


For  consistency  with  the  temporal  logic  literature,  we 
define  a  computational  state  (c-state)  as  the  action 
chosen  from  each  process  state.  Then  a  computation  is 
a  finite  sequence  of  temporally  ordered  computational 
states,  i.e.,  a  string.  To  distinguish  the  two  types  of 
states,  we  will  refer  to  a  process  state  as  a  p- state. 

F  is  a  property  true  (false)  for  a  process  S,  i.e.,  S  |=  F 
{S  ^  F),  if  and  only  if  it  is  true  for  every  string  in  the 
language  >C(S)  (false  for  some  string  in  C{S)).  The 
notation  x  f=  F  (x  F)  means  string  x  satisfies  (does 
not  satisfy)  property  F,  i.e.,  the  property  holds  (does 
not  hold)  for  x.  Before  defining  what  it  means  for 
properties  to  be  true  (i.e.,  hold)  for  a  string,  we  first 
define  what  it  means  for  a  formula  that  is  Boolean 
expression  to  be  true  at  a  c-state.  A  c-siatc  formula 
p  is  true  (false)  at  c-state  ar,-,  i.e.,  I,-  |=  p  (xj  ^  p) 
if  and  only  if  x,  :<  p  (x,  p),  i.e.,  Xi  *  p  ^  0  (=  0) 
because  p  is  a  Boolean  expression  with  no  variables  on 
the  same  Boolean  algebra  used  by  process  S,  and  x,- 
is  an  atom  of  that  algebra.  For  example,  (A:MAVs- 
wait)  1=  ((A:MAVs-wait)  -f  (A:no-MAVs))  for  c-state 
(A:MAVs-wait)  and  c-state  formula  ((A:MAVs-wait) 
-1-  (A;no-MAVs)). 

A  c-state  formula  p  is  true/false  in  particular  c-states 
of  a  string.  Property  F  is  defined  in  terms  of  p,  and 
is  true/false  of  an  entire  string,  i.e.,  x  F  or  x  F 
for  string  x.  We  now  define  two  property  classes  that 
are  among  those  most  frequently  encountered  in  the 
verification  literature  for  finite  strings.  Assume  x  = 
(xo, ...,  Xn)  is  a  string  of  process  S.  For  c-state  formula 
p  and  plan  S,  define  Sometimes  property  F  =  O  p 
(“Sometimes  p”)  as  a  property  that  is  true  for  string  x 
if  only  if  p  is  true  in  at  least  one  c-state  x,-  of  x,  where 
0  <  i  <n.  An  Invariance  property  F  =  (“Invariant 
p”)  is  a  property  true  for  string  x  if  and  only  if  p  is 
true  in  every  c-state  x,  of  x. 

Continuing  with  the  MAVs  example,  a  desirable  In¬ 
variance  property  Pj  states  that  “only  one  MAV  enters 
the  region  at  a  time.”  This  can  be  expressed  in  tempo¬ 
ral  logic  as  F/  =  D(  -1  ((A:MAVs-go)  +  (B:MAVs-go))). 
A  desirable  Sometimes  property  Ps  states  that  “Some¬ 
times  MAVs  from  swarm  A  enter  the  region.”  In  logic 
this  property  is  expressed  as  F5  =  <5>  (A;MAVs-go). 
F/,  but  not  Ps,  holds  for  the  multiagent  plan  S. 

3.3  “SAFE”  LEARNING 

This  paper  is  concerned  with  “safe”  machine  learning 
methods  (SMLs),  i.e.,  machine  learning  operators  that 
preserve  properties,  also  called  “correctness  preserv¬ 
ing  mappings.”  For  plan  S  and  property  F,  suppose 
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verification  has  succeeded  prior  to  learning,  i.e.,  Vx, 
X  6  >C(S')  implies  x  |=  P  (i.e.,  5  |=  P).  Then  according 
to  Gordon  (1997a),  a  machine  learning  operator  ml{S) 
is  an  SML  if  and  only  if  verification  succeeds  after 
learning,  i.e.,  Vx,  x  €  C{ml{S))  implies  x  |=:  ml{P). 
Note  that  a  machine  learning  operator  may  also  affect 
the  property  P,  which  could  be  undesirable.  There¬ 
fore,  being  an  SML  is  not  always  sufficient.  Additional 
requirements  on  learning  -  in  particular,  abstraction, 
are  discussed  next. 

4  BOOLEAN  ALGEBRA 
ABSTRACTION 

Kurshan  (1994)  presents  methods  for  improving  the 
efficiency  of  automata-based  verification,  but  does  not 
consider  the  possibility  of  automata,  such  as  agents, 
that  can  learn.  By  applying  some  of  the  results  of 
Kurshan  (1994)  in  a  novel  way,  Gordon  (1997a;  1997b) 
shows  that  when  agents  learn  using  certain  abstrac¬ 
tions,  the  abstractions  are  a  priori  guaranteed  to  be 
SMLs  for  all  property  classes  -  but  only  if  abstraction 
is  performed  to  both  the  plan  and  property.  ®  There¬ 
fore,  this  section  identifies  situations  in  which  it  is  ac¬ 
ceptable  to  apply  an  SML  abstraction  to  a  property. 

The  SML  abstractions  include  very  useful  ones,  such  as 
partitioning  the  Boolean  algebra  atoms  e.g.,  using  con¬ 
structive  induction,  and  projection,  which  is  a  form  of 
feature  selection  (or,  more  properly,  action  deletion). 
Although  the  methods  described  in  this  section  apply 
to  any  of  these  abstractions,  for  illustration  we  focus 
only  on  projection,  which  is  a  mapping  from  a  Boolean 
algebra  to  a  subalgebra.  For  a  formal  definition  of  pro¬ 
jection,  see  Kurshan  (1994).  Here,  we  continue  with 
the  MAVs  example. 

Suppose  all  the  MAVs  in  the  B  swarm  leave  the  re¬ 
gion.  To  incorporate  this  knowledge,  Boolean  algebra 
projection,  a  type  of  abstraction,  projects  the  product 
algebra  ABC  onto  subalgebra  AC.  Projection  projj^Q  : 
ABC  — *■  AC  is  defined  as  proj^^{a  *  b  *  c)  —  a  *  c 
for  atoms  a  £  r(M),  b  £  r(,H)  and  c  £  r(C),  and 
is  extended  linearly  to  the  full  algebra.  For  example, 
projj^c  MAVs-wait)  *  (B:  MAVs-wait)  *  (C:  go- 
A))  =  (A:  MAVs-wait)  *  (C:  go- A).  In  addition  to  re¬ 
moving  entire  subalgebras,  it  is  also  possible  to  remove 
atoms  from  within  a  subalgebra. 

Projection  proj^^  removes  all  references  to  swarm  B 
from  the  multiagent  plan  S.  This  projection  reduces 

^This  result  applies  to  agents  with  finite  or  infinite 
lifetimes. 


B’s  plan  to  the  trivial  plan  which  allows  B  to  do  any¬ 
thing.  We  assume  that  when  the  agent  applies  a  pro¬ 
jection  to  the  plan,  it  has  justification  to  do  so  -  be¬ 
cause  the  purpose  of  abstraction  is  to  modify  the  plan. 
Modification  of  the  property,  on  the  other  hand,  may 
be  a  side  effect  required  for  an  a  priori  guarantee  that 
the  abstraction  is  an  SML.  Applying  projj^^  to  the 
Invariance  property  Pj,  which  states  that  “only  one 
MAV  may  enter  the  region  at  a  time,”  results  in  a 
property  which  accepts  any  multiagent  plan  of  agents 
A,  B,  and  C.  When  applied  to  both  plan  and  property, 
projj^c  i®  SML.  Nevertheless,  if  the  B  swarm  re¬ 
turns  to  the  region  and  is  restored  into  the  multiagent 
plan,  then  this  new  property  which  allows  the  agents 
to  do  anything  could  have  disastrous,  unintended  (by 
the  user)  consequences. 

This  example  illustrates  our  dilemma;  If  we  abstract 
the  property  along  with  the  plan,  the  abstraction  will 
be  guaranteed  a  priori  to  be  an  SML.  However,  by  ab¬ 
stracting  the  property,  we  risk  violating  the  user’s  orig¬ 
inal  intentions.  When  is  it  ok  to  abstract  a  property? 
There  are  at  least  three  cases  when  it  is  permissible: 

(1)  When  the  abstraction  is  property  invariant. 

Applying  the  projection  proj^c  to  the  Sometimes 
property  Ps,  which  states  that  “Sometimes  MAVs 
from  swarm  A  enter  the  region,”  leaves  Ps  invariant, 
i.e.,  projJ^(<{Ps)  =  Ps’  Therefore  the  abstraction  is 
property  invariant.  The  intuition  is  that  the  behavior 
of  agent  B  is  irrelevant  when  testing  this  property. 

In  general,  to  determine  whether  property  invariance 
holds,  an  agent  must  apply  abstraction  to  each  prop¬ 
erty  P  and  then  check  whether  P  remains  unaltered  by 
abstraction.  This  simple  syntactic  check  is  a  form  of 
incremental  re-verification  because  it  is  localized  to  a 
test  on  the  property  alone.  The  check  has  a  worst  case 
time  complexity  of  0(|P|)  for  any  property  P.  This 
is  lower  than  the  worst  case  time  complexity  of  com¬ 
plete  re-verification  from  scratch  (following  abstrac¬ 
tion),  which  is  0(|r(/C)l*|F|)  for  Invariance  and  Some¬ 
times  properties,  where  |r(AC)|  is  the  number  of  atoms 
in  the  plan  (  Lichtenstein  &  Pnueli,  1984).  Further¬ 
more,  if  the  agent  will  only  accept  property  invariant 
abstractions,  then  the  cost  of  plan  abstraction  can  be 
avoided  when  this  incremental  check  fails. 

(2)  When  the  abstraction  is  property  irrelevant. 

An  example  is  when  the  agents  discover,  or  are  told 
about,  a  permanent  change  that  henceforth  renders 
one  or  more  items  (e.g.,  an  agent  or  action)  irrelevant. 
The  term  “permanent”  in  this  context  means  a  change 
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whose  effects  are  sustained  at  least  until  the  last  agent 
has  terminated.  Because  the  change  is  permanent,  we 
can  be  assured  that  no  problems  are  caused  by  apply¬ 
ing  an  SML  abstraction  to  the  properties. 

Consider  an  example  in  which  a  swarm  agent  be¬ 
comes  irrelevant.  Suppose  the  lives  of  all  MAVs  in 
the  B  swarm  have  terminated,  e.g.,  they  become  per¬ 
manently  inoperative,  but  we  wish  to  continue  with 
the  multiagent  plan  because  the  other  agents  survived. 
Then  the  application  of  fo  f^e  property  Pj  has 

no  significant  effect  -  because  Pj  is  no  longer  needed. 

(3)  When  the  abstraction  is  property  reversible. 

Suppose  the  agents  determine  that  one  or  more  items 
are  not  relevant  to  the  objectives  of  their  multiagent 
plan,  but  this  is  a  temporary  change  in  condition,  i.e., 
the  items  may  become  relevant  again.  For  example, 
agents  may  disappear  to  attend  to  different  tasks  then 
possibly  return,  and  actions  may  become  temporar¬ 
ily  disabled  due  to  mechanical  failures.  Items  irrel¬ 
evant  to  the  multiagent  objectives  could  be  removed 
from  the  multiagent  plan  and  also  from  the  properties. 
Under  these  circumstances,  we  want  the  abstraction 
to  be  property  reversible.  An  abstraction  is  property 
reversible  if  the  pre-abstraction  property  can  be  re¬ 
stored,  e.g.,  by  saving  it.  This  way  we  can  retest  the 
original  property  after  undoing  the  effects  of  abstrac¬ 
tion. 

We  only  want  our  agents  to  perform  property  irrel¬ 
evant  and  property  reversible  abstractions  when  ab¬ 
straction  is  restricted  to  removing  irrelevant  items.  If 
agents  are  not  told  relevance,  they  may  need  to  per¬ 
form  relevance  determination,  perhaps  using  methods 
such  as  those  of  Subramanian  (1988).  Other  research 
related  to  the  ideas  in  this  section  includes  feature  se¬ 
lection  (see  http://ai.iit.nrc.ca/bibliographies/feature- 
selection.html),  and  plan  abstraction  (Knoblock, 
1990). 

5  GENERALIZATION 

Although  we  have  been  unable  to  obtain  positive  a 
priori  results  for  generalization,  this  section  presents 
a  novel  method  for  incremental  re-verification  after 
generalization.  Efficiency  is  gained  by  tailoring  in¬ 
cremental  re-verification  methods  to  specific  prop¬ 
erty  classes.  Because  there  are  only  about  a  dozen 
property  classes  commonly  used  in  practice  (Kurshan, 
1994),  this  seems  reasonable  to  do.  The  re- verification 
method  presented  in  this  section  is  specific  to  Invari¬ 
ance  properties.  A  method  for  Sometimes  properties 


may  be  found  in  Gordon  (1997b).  Methods  for  other 
property  classes  are  currently  being  investigated. 

Generalization  differs  from  abstraction  in  that  you  are 
not  changing  the  entire  Boolean  algebra  (e.g.,  taking 
a  subalgebra)  but  instead  you  are  increasing  the  gen¬ 
erality  of  a  transition  condition  labeling  one  or  more 
edges  (for  simplicity,  here  we  consider  one).  Gener¬ 
alization  is  done  when  the  agent  discovers  that  the 
transition  can/should  be  taken  under  a  larger  set  of 
circumstances.  It  is  only  done  to  the  plan.  In  the 
context  of  a  process,  generalization  raises  the  level  of 
a  particular  p-state-to-p-state  transition  condition  in 
the  partial  order  whereas  specialization  lowers  it, 
e.g.,  as  in  Mitchell’s  Version  Spaces  (Mitchell,  1978). 

Formally,  we  define  generalization  of  the  condition 
along  edge  (v,w)  as  follows.  Generalization  operator 
mlgcn  :  S  ^  S' ,  where  both  S  and  S'  use  Boolean  alge¬ 
bra  /C,  is  defined  as  m/jen  ^  ^KiS)  — ►  Mk.{S'),  where 
mlgen{Mic(v,w))  =  Mfciv,  w)  +  z,  for  some  2  6  V.® 

An  example  of  generalization  is  the  following.  The 
transition  condition  associated  with  the  edge  ((WAIT, 
WAIT,  GO-A),  (GO,  WAIT,  GO-B))  in  the  multi¬ 
agent  plan  S  is  (A:MAVs-wait)  ♦  (B:MAVs-wait)  * 
(C:go-A).  This  could  be  generalized  to  ((A:MAVs- 
wait)  *  (B:MAVs-wait)  ♦  (C:go-A))  -|-  ((A:MAVs-wait) 
♦  --(BiMAVs-wait)  +  (C:go-A)),  i.e.,  (A:MAVs-wait)  + 
(C:go-A)  for  new  plan  S’. 

To  illustrate  our  incremental  approach,  recall  S  satis¬ 
fies  the  Invariance  property  Pi  which  states  that  “only 
one  MAV  enters  the  region  at  a  time,”  i.e.,  □  (  -1 
((A:MAVs-go)  ♦  (B:MAVs-go))).  We  could  check  this 
property  against  the  entire,  new  plan  S’,  but  a  prefer¬ 
able  alternative  is  to  simply  check  it  against  the  new 
addition  to  the  transition  condition,  namely,  is  Pj  sat¬ 
isfied  by  (A:MAVs-wait)  *  ->  (B:  MAVs-wait)  +  (C:go- 
A)?  In  fact  it  is,  because  (A:MAVs-go)  is  not  true,  and 
that  is  all  we  need  to  know  to  be  sure  that  the  mlg^n 
just  applied  is  an  SML.  We  can  now  formalize  this. 

Let  us  consider  the  Invariance  property  P  =  Q  p  for 
c-state  formula  p.  Let  y  be  the  existing  transition  con¬ 
dition  for  edge  (v,  w)  in  plan  S,  i.e.,  Mic{v,w)  =  y.  We 
previously  defined  what  it  means  for  a  c-state  formula 
p  to  be  true  at  a  c-state,  but  it  is  also  useful  to  de¬ 
fine  what  it  means  for  a  c-state  formula  to  be  true  of 
a  transition  condition.  Let  r(V)y  =  {a  |  a  €  r(V) 
and  a  <  y],  A  c-state  formula  p  is  defined  to  be  true 
of  a  transition  condition  y,  i.e.,  y  )=  p,  if  and  only  if 
Vaer(A:)^,  a<p. 


differs  from  S  only  by  the  results  of  mlgc„. 
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Assume  every  string  x  in  jC{S)  satisfies  Invariance 
property  P,  so  for  each  x,  p  is  true  of  every  atom 
in  X.  This  implies  y  \=  p7  Now  we  generalize  the 
edge  to  form  S'  via  ml  gen  {Mic{v,w))  =  y-\-  z. 

This  operator  mlgen  is  an  SML  with  respect  to  Invari¬ 
ance  property  P  if  and  only  if  5'  \=  P,  which  is  true 
if  and  only  if  2  1=  p.  The  reason  for  this  is  that  we 
know  S  satisfies  P  from  our  original  verification,  and 
therefore  p  is  true  for  all  atoms  in  all  strings  in  C{S). 
The  only  new  atoms  in  JC{S')  but  not  in  C{S)  are  in 
r(/C)j .  Therefore,  if  2  |=  p,  then  p  is  true  for  all  atoms 
in  C{S'),  which  implies  every  string  in  C{S')  satisfies 
P,  i.e..  S'  1=  P.  Therefore,  re-verification  need  only 
test  whether  2  |=  p,  i.e.,  Va  G  '^P-  (We  as¬ 

sume  transition  conditions  are  represented  extension- 
ally,  i.e.,  as  the  unique  sum  of  atoms  equivalent  to  the 
Boolean  expression.)  If  z  ^  p,  S'  P.^  This  test 
is  incremental  because  it  is  localized  to  just  checking 
whether  the  property  holds  of  the  newly  added  atoms 
in  2,  rather  than  all  atoms  in  JC(S'). 

For  example,  suppose  a,  b,  c,  d,  and  e  are  atoms,  and 
the  transition  condition  y  between  v  and  w  equals  a. 
Let  (a,  b,  b,  d)  be  an  accepting  string  of  S  that  in¬ 
cludes  V  and  w  as  the  first  two  vertices  in  its  accepting 
run.  The  property  is  P  =  Q  -1  e.  Assume  the  fact  that 
this  string  satisfies  -■  e  was  proved  in  the  original  ver¬ 
ification.  Suppose  mlgen  generalizes  Mic{v,w)  from  a 
to  (a  -f  c),  which  adds  the  string  (c,  b,  b,  d)  to  C{S'). 
Then  rather  than  test  whether  the  elements  of  {  a,  b, 
c,  d  }  are  ^  e,  all  we  really  need  to  test  is  whether 
c  -1  e  -  because  c  is  the  only  newly  added  atom. 

By  storing  and  reusing  knowledge  from  previous  veri- 
fication(s),  we  can  increase  the  efficiency  of  this  test. 
Suppose  some  atoms  a  such  that  0X2  were  tested 
for  a  X  p  during  previous  verification(s),  and  the  out¬ 
comes  of  these  tests  were  stored.  Then  lookup  will 
suffice,  and  the  only  atoms  in  r(/C)^  that  need  to  be 
tested  against  p  during  the  current  re-verification  are 
those  not  previously  tested. 

What  cost  benefit(s)  does  incremental  re-verification 
have  over  complete  re-verification  from  scratch?  Ver¬ 
ification,  or  complete  re-verification  from  scratch,  in 
the  worst  case  has  time  complexity  0(|r(/C)|  *  |p|)  for 
Invariance  properties,  where  |r(A)|  is  the  total  number 
of  atoms,  and  |p|  is  the  length  of  the  c-state  formula 
p  (Lichtenstein  &  Pnueli,  1984).  This  is  because  the 

^This  statement  is  based  on  our  assumption  that  (o,  10) 
is  part  of  an  accepting  run  for  at  least  one  x  G  'C(5).  This 
assumption  motivates  re-verification. 

*That  is,  unless  {v,  w)  is  not  part  of  any  accepting  run 
-  but  then  the  test  is  unnecessary. 


c-state  formula  may  have  to  be  tested  in  every  unique 
c-state,  which  is  an  atom.  |r(/C)|  is  exponential  in 
the  number  of  single  agent  plans  forming  a  multiagent 
plan.  In  the  worst  case,  incremental  re-verification 
has  the  same  time  complexity,  but  this  would  be  a 
very  bizarre  situation  indeed.  It  would  require  that  no 
atoms  were  tested  against  the  property  in  the  original 
verification  (which  could  occur  if  JC{S)  were  empty), 
and  all  atoms  are  added  to  the  transition  condition 
during  generalization,  i.e.,  Va  G  r(/C),  a  <  z. 

Let  us  consider  a  more  realistic  comparison.  The  worst 
case  time  complexity  for  complete  re-verification  as¬ 
sumes  all  c-states  are  reachable  from  some  initial  p- 
state.  This  may  not  be  true,  e.g.,  the  number  of  ini¬ 
tial  p-states  might  be  very  small.  Re-verification  is 
required  to  determine  Vx  G  T(5')  whether  x  |=  f.  At 
the  very  least,  complete  re- verification  of  an  Invariance 
property  P  =  □  p  must  test  whether  a;,-  |=  p  Vxf  in  x, 
Vx  G  C{S').  The  complexity  of  this  test  is  C complete  = 
0(|r(A::)£(5,)|  *  |p|),  where  |r(/C)£(5,)|  is  the  number 
of  unique  atoms  in  all  strings  x  G  T(5'). 

A  more  realistic  cost  estimate  for  incremental  re¬ 
verification  is  Cincrem  =  o(|r(;c),(^)|  +  (|r(x:)„,(^)  |  * 
IpD),  where  r(/C),(^)  (r(/C)„,(^))  contains  atoms  whose 
results  are  (are  not)  previously  stored.  The  first  ad¬ 
dend  is  the  cost  of  lookup  of  results  from  previous  ver- 
ification(s),  and  the  second  addend  is  the  cost  added 
by  testing  the  atoms  that  were  not  previously  tested. 
Whenever  generalization  is  reasonably  conservative, 
i.e.,  |r(/C)^|  <<  |r(/C)£(5/j|,  incremental  can  provide 
considerable  savings  over  complete  re-verification! 

6  DISCUSSION 

Here  we  have  addressed  the  question  of  how  agents  can 
adapt  (learn)  safely,  i.e.,  by  preserving  critical  prop¬ 
erties,  and  how  they  can  do  this  in  a  time-efficient 
manner.  We  extended  the  work  of  Gordon  (1997a)  to 
obtain  positive  results  for  two  popular  machine  learn¬ 
ing  methods:  abstraction  and  generalization.  For  ab¬ 
straction  to  be  a  priori  safe  (property-preserving),  the 
property  must  also  be  abstracted.  This  paper  enu¬ 
merates  situations  in  which  it  is  permissible  to  ab¬ 
stract  the  property.  Furthermore,  novel  incremental 
re-verification  methods  are  presented  for  abstraction 
and  generalization.  These  methods  have  the  potential 
to  provide  large  computational  savings  over  complete 
re-verification  from  scratch.  With  our  methods  (in¬ 
cluding  a  priori),  agents  can  use  abstraction  and  gen¬ 
eralization  to  adapt  to  novel  situations,  and  can  do  so 
with  quick  checks  that  ensure  the  reliability  of  their 
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behavior. 

There  is  a  small  amount  of  prior  research  on  incre¬ 
mental  re-verification.  Reps  and  Teitelbaum  (1989) 
developed  a  verifier  for  users  to  check  their  code  while 
writing  in  traditional  programming  languages,  such  as 
PL/I.  Their  verifier  can  incrementally  re-check  soft¬ 
ware  after  edits  using  Hoare-style  proofs.  However, 
unlike  our  re-verification  methods,  these  proofs  require 
some  interaction  with  the  user.  Sokolsky  and  Smolka 
(1994)  have  an  incremental  method  for  verifying  added 
or  deleted  state  transitions  in  an  automaton-like  repre¬ 
sentation.  However  they  do  not  address  generalization 
or  abstraction.  Finally,  Weld  and  Etzioni  (1994)  have 
a  method  to  incrementally  test  an  agent’s  plan  to  de¬ 
cide  whether  to  add  new  actions  to  the  plan.  There 
are  certain  similarities  between  our  work  and  that  of 
Weld  and  Etzioni.  They  add  actions  to  a  plan  only 
when  their  effects  do  not  violate  doni-disiurb  proper¬ 
ties,  which  are  a  type  of  Invariance  property.  Our  gen¬ 
eralization  also  adds  actions  to  a  plan.  Furthermore, 
both  approaches  localize  verification.  The  main  differ¬ 
ences  are  that  unlike  Weld  and  Etzioni,  we:  (1)  use 
a  formal  foundation  based  on  the  verification  litera¬ 
ture,  in  particular,  model-checking  and  automata,  (2) 
assume  the  existence  of  prior  verification  knowledge 
and  use  this  knowledge  to  streamline  re- verification, 
(3)  use  reactive  rather  than  necessarily  goal-oriented 
plans,  and  (4)  address  abstraction. 

One  aspect  of  Weld  and  Etzioni  (1994)  that  was  pur¬ 
posely  not  addressed  here  is  that  of  how  to  select 
which  method  to  use  in  repairing  a  plan.  This  is  a 
rich  issue  for  future  research,  and  could  draw  on  cost- 
effective  methods  such  as  those  of  Joslin  and  Pollack 
(1994).  Rather  than  repair,  this  paper  focuses  on  re¬ 
verification.  We  are  unaware  of  any  methods  besides 
ours  for  incrementally  re-verifying  abstraction  or  gen¬ 
eralization  in  automata.  Much  more  work  remains 
to  be  done  on  the  important  topic  of  incremental  re¬ 
verification  -  especially  for  adaptive  agents. 
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Abstract 

In  this  paper,  we  propose  a  machine-learning 
solution  to  problems  consisting  of  many  sim¬ 
ilar  prediction  tasks.  Each  of  the  individual 
tasks  has  a  high  risk  of  overfitting.  We  com¬ 
bine  two  types  of  knowledge  transfer  between 
tasks  to  reduce  this  risk:  multi-task  learning 
and  hierarchical  Bayesian  modeling.  Multi¬ 
task  learning  is  based  on  the  assumption  that 
there  exist  features  typical  to  the  task  at 
hand.  To  find  these  features,  we  train  a  huge 
two-layered  neural  network.  Each  task  has 
its  own  output,  but  shares  the  weights  from 
the  input  to  the  hidden  units  with  all  other 
tasks.  In  this  way  a  relatively  large  set  of 
possible  explanatory  variables  (the  network 
inputs)  is  reduced  to  a  smaller  and  easier 
to  handle  set  of  features  (the  hidden  units). 
Given  this  set  of  features  and  after  an  appro¬ 
priate  scale  transformation,  we  assume  that 
the  tasks  are  exchangeable.  This  assumption 
allows  for  a  hierarchical  Bayesian  analysis  in 
which  the  hyperparameters  can  be  estimated 
from  the  data.  Effectively,  these  hyperpa¬ 
rameters  act  as  regularizers  and  prevent  over¬ 
fitting.  We  describe  how  to  make  the  system 
robust  against  nonstationarities  in  the  time 
series  and  give  directions  for  further  improve¬ 
ment.  We  illustrate  our  ideas  on  a  database 
regarding  the  prediction  of  newspaper  sales. 

1  INTRODUCTION 

1.1  PROBLEM  DESCRIPTION 

In  this  paper,  we  focus  on  problems  such  as 


•  efficient  distribution  of  newspapers  and  maga¬ 
zines; 

•  predicting  gas  consumption  of  different  compa¬ 
nies; 

•  analyzing  sales  figures  of  many  company 
branches; 

•  optimizing  stock  selection  and  portfolio  manage¬ 
ment. 

The  main  characteristic  of  each  of  these  problems  is 
that  they  are  in  fact  composed  of  many  similar  predic¬ 
tion  tasks.  These  individual  tasks  usually  have  a  low 
signal-to-noise  ratio:  in  some  cases  one  would  be  happy 
if  one  could  explain  10  percent  of  the  variance  in  the 
data.  Because  of  the  large  amount  of  different  tasks, 
any  performance  improvement  is  almost  immediately 
significant,  both  financially  and  statistically.  Further¬ 
more,  in  most  cases  one  can  easily  come  up  with  quite 
a  few  (possibly)  explanatory  variables.  For  example, 
in  predicting  sales  figures,  one  may  want  to  include 
some  of  the  recent  sales  figures,  sales  figures  from  the 
same  period  last  year,  sales  figures  from  other  com¬ 
panies,  different  kinds  of  weather  information,  and  so 
on.  Overfitting  then  becomes  a  major  concern.  The 
question  addressed  in  this  paper  is  therefore:  how  can 
we  exploit  the  benefit  of  not  having  a  single  predic¬ 
tion  task  but  a  whole  set  of  seemingly  similar  tasks, 
such  that  we  can  reduce  the  risk  of  overfitting  in  a 
computationally  feasible  way? 

We  propose  to  combine  two  approaches:  multi¬ 
task  learning,  suggested  in  the  neural-network 
and  machine-learning  community,  and  hierarchical 
Bayesian  modeling,  developed  in  the  statistics  com¬ 
munity.  Multi-task  learning  is  treated  in  Section  2. 
The  idea  is  that  tasks  can  learn  from  each  other  by 
sharing  the  same  features.  The  underlying  assump¬ 
tion  is  that  such  features,  typical  to  the  task  at  hand, 
indeed  exist.  Hierarchical  Bayesian  modeling  applies 
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when  one  can  rely  on  the  assumption  that  a  priori, 
i.e.,  before  taking  into  account  the  data  itself,  there  is 
no  information  to  distinguish  the  model  parameters  of 
any  one  task  from  those  of  any  of  the  other  tasks.  We 
will  describe  hierarchical  modeling  in  Section  3. 

We  will  illustrate  our  ideas  on  a  database  concern¬ 
ing  the  prediction  of  newspaper  sales.  This  databa.se 
consists  of  several  years  of  weekly  sales  figures  for  a 
set  of  343  different  points  of  sale.  Each  points  of  sale 
represents  a  different  time-series  prediction  task.  In 
Section  1.2  we  first  discuss  how  to  make  the  tasks 
“sufficiently  similar”,  i.e.,  such  that  we  can  apply  the 
approach  proposed  in  Section  2  and  3.  Although  our 
examples  include  collections  of  time-series  tasks,  our 
analysis  in  these  sections  in  completely  static.  In  Sec¬ 
tion  4.1  we  therefore  describe  a  first  crude  attempt  to 
handle  nonstationarities  in  the  data.  Section  4  further 
links  the  different  components  together,  recapitulates 
the  assumptions  and  discusses  directions  for  further 
improvements. 

1.2  MAKING  TASKS  SIMILAR 

The  underlying  assumption  of  both  the  multi-task 
learning  approach  and  the  hierarchical  Bayesian  ap¬ 
proach  is  that  the  different  tasks  can  be  considered 
similar.  This  is  not  always  immediately  obvious.  As 
can  be  seen  for  example  from  Figure  1,  where  we  plot¬ 
ted  the  averages  sales  of  343  newspaper  points  of  sale 
versus  their  standard  deviation,  the  typical  number  of 
single  copies  sold  at  each  outlet  ranges  from  just  a 
few  to  a  few  hundred.  Still  we  want  to  assume  that 
the  tasks  are,  in  some  sense,  exchangeable.  In  Sec¬ 
tion  2  this  implies  that  sales  figures,  when  used  as 
explanatory  variables,  should  have  more  or  less  the 
same  meaning:  20  newspapers  may  be  quite  a  lot  for  a 
small  outlet,  but  are  well  below  average  for  a  large  out¬ 
let.  Similar  reasoning  applies  to  the  scaling  of  model 
parameters  in  our  choice  of  prior  distributions  in  Sec¬ 
tion  3.  In  the  newspaper  example,  our  working  hy¬ 
pothesis  will  be  that  the  points  of  sale  are  exchange¬ 
able,  ayier  correcting  for  their  typical  scale. 

Such  a  correction  can  be  accomplished  by  normalizing 
the  sales  figures  for  each  outlet  separately.  The  strong 
correlation  between  the  average  sales  and  the  noise 
level  in  Figure  1  [B?  =  0.90  on  the  logarithmic  scale) 
suggests  that  we  can  represent  the  typical  scale  of  each 
individual  outlet  through  just  one  parameter  6, ,  denot¬ 
ing  the  average  sales  of  outlet  i.  We  can  correct  for 
this  typical  scale  by  normalizing  all  sales  figures  using 
this  average  and  the  fitted  standard  deviation  as  given 
by  the  dashed  line  in  Figure  1. 


average  sales 


Figure  1:  Average  newspaper  sales  0,  versus  the  cor¬ 
responding  standard  deviation  for  343  different  points 
of  sale.  The  dashed  line  is  the  least  squares  fit  of  the 
logarithm  of  the  standard  deviation  as  a  function  of 
the  logarithm  of  the  average  sales. 

2  MULTI-TASK  LEARNING 

2.1  ARGUMENTATION 

We  want  to  build  and  train  a  model  relating  a  set  of 
explanatory  variables  x  to  an  output  z.  First  we  have 
to  choose  which  explanatory  variables  to  include  in 
such  a  model.  Typically,  it  is  easy  to  come  up  with  on 
the  order  of  njnputs  ^  20  input  variables  (see  for  ex¬ 
ample  Table  1  where  we  describe  the  explanatory  vari¬ 
ables  incorporated  in  our  newspaper  example).  With 
on  the  order  of  a  hundred  training  patterns  per  task 
and  a  low  signal-to-noise  ratio,  any  attempt  to  fit  a 
direct  model  between  the  input  variables  and  the  tar¬ 
gets  corresponding  to  a  single  task,  is  doomed  to  lead 
to  overfitting  and  thus  lousy  prediction  performance. 

We  need  some  preprocessing  stage  transforming  the 
^inputs  input  variables  x  into  a  small  set  of  say 
^features  ^  3  features  y,  typical  to  the  task  at  hand.  In 
practice,  one  often  tries  to  find  these  features  through 
an  iterative  process  of  thinking  and  testing  (see  also 
Figure  4).  For  example,  one  tries  several  ways  of  com¬ 
bining  the  most  recent  sales  figures  into  a  single  num¬ 
ber,  tests  each  of  them,  and  takes  the  best.  Here  we 
propose  to  learn  this  transformation.  We  combine  all 
tasks  into  one  big  network  (see  Figure  2).  The  in¬ 
put  units  are  connected  to  the  hidden  (feature)  units 
through  a  weight  matrix  B.  The  weight  vector  con¬ 
necting  the  hidden  units  to  the  output  unit  corre¬ 
sponding  to  task  i  is  denoted  Aj.  In  other  words,  all 
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outputs 


explanatory  variables 

Figure  2:  Typical  network  structure:  a  reasonably 
large  number  of  input  units,  a  small  number  of  hid¬ 
den  units,  and  huge  number  of  output  units. 

tasks  share  the  weight  matrix  B,  but  have  independent 
weight  vectors  Ai . 

In  this  paper,  we  will  consider  the  case  of  linear  hid¬ 
den  units.  Given  an  input  vector  x,  the  features  and 
outputs  are  then  computed  through 

2/j  =  XI  =  ^io  +  X >  (1) 

k  j 

where  Zi  refers  to  the  output  corresponding  to  task 
i.  We  will  use  Ai  to  denote  the  set  of  all  hidden- 
to-output  weights  specific  to  each  task,  i.e.,  Ai  = 
{Aio, . . We  refer  to  Ai  and  cr;  as  the 
set  of  model  parameters  of  task  i. 

The  inputs  x  can  be  divided  into  two  categories:  those 
with  equal  input  values  across  all  tasks  and  those  with 
input  values  specific  to  a  particular  task.  Nonspecific 
inputs  in  the  newspaper  example  (see  Table  1)  are  e.g. 
seasonal  variables  and  weather  figures  (we  considered 
the  “average”  weather  across  The  Netherlands  instead 
of  more  local  weather  figures).  The  specific  inputs 
should  have  more  or  less  the  same  meaning  across  all 
tasks.  This  is  accomplished  by  the  transformation  of 
the  sales  figures  described  and  discussed  in  Section  1.2. 
We  will  use  Xi  to  denote  the  set  of  inputs  correspond¬ 
ing  to  task  i. 

Hidden  units  do  not  have  bias  units:  it  is  easy  to  see 
that  these  can  be  scaled  away  into  the  bias  of  the  out¬ 
put  units.  We  further  assume  a  Gaussian  noise  model 
with  standard  deviation  cr,- ,  which  is  different  for  each 
task,  but  independent  of  the  inputs  a;,-.  Assuming  that 


the  targets  D,-  =  {tf}  are  independently  and  iden¬ 
tically  distributed  (iid)  given  the  inputs  /,•  =  {xf}, 
model  parameters  A,-  and  cr,-  and  feature  matrix  B,  we 
can  compute  the  probability  of  observing  these  targets 
through 


P{Di\Ii,Ai, 


(Ti,  B)  oc  exp  [-E{Ai,ai,  B\Di,Ii)]  ,  (2) 


where  we  have  defined  the  error 


EiAi,ai,B\Di,Ii)  =  lYl 


it'"  - 

Li_^_log<,. 


(3) 

with  the  output  zf  computed  as  in  (1).  For  notational 
convenience  we  will  from  now  on  leave  out  the  explicit 
dependency  on  the  inputs  U .  The  iid  assumption  may 
be  too  strong  for  time-series  prediction  tasks.  We  will 
come  back  to  that  in  Section  4.1. 


We  propose  to  find  an  appropriate  feature  matrix  B 
through  a  maximum  likelihood  procedure:  we  mini¬ 
mize  the  error  (3),  averaged  over  all  ntasks  tasks  and 
obtain  the  maximum  likelihood  solutions  5^^,  Af^^ 
and 


2.2  SIMILAR  IDEAS 

There  has  been  quite  a  lot  of  interesting  research  in 
the  area  of  inductiye  transfer,  yielding  both  empirical 
and  theoretical  evidence  that  multi-task  learning  im¬ 
proves  performance  (see  [10]  for  collections  of  papers 
on  multi-task  learning).  In  [1]  the  advantage  of  com¬ 
bining  several  tasks  is  investigated  theoretically,  under 
the  assumption  that  a  feature  matrix  B  common  to  all 
tasks  indeed  exists. 

In  most  approaches  to  multi-task  learning  (see  e.g.  [2] 
and  references  therein),  all  tasks  receive  the  same  in¬ 
put  information,  i.e.,  all  inputs  are  nonspecific.  As 
in  our  case,  the  different  tasks  are  forced  to  share 
the  same  hidden  unit  representation.  Often,  but  not 
always,  this  leads  to  a  better  generalization  perfor¬ 
mance  [2].  The  problems  considered  in  the  litera¬ 
ture  are  mostly  artificial  and  combine  on  the  order 
of  10  or  less  tasks.  An  exception  is  [7],  where  different 
tasks  concerning  stock  selection  and  portfolio  manage¬ 
ment  are  combined  in  various  ways.  This  experimental 
study  is  probably  closest  in  spirit  to  our  multi-tasking 
approach,  but  its  number  of  tasks  (36)  is  still  much 
smaller  than  the  343  real-world  tasks  that  we  use  in 
our  simulation. 
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Group 

# 

Type 

Bi 

B2 

B3 

Icist  year  sales 

3 

specific 

0.8 

54.1 

2.9 

last  year  sellouts 

3 

specific 

0.6 

1.0 

3.6 

recent  srJes 

5 

specific 

93.5 

15.4 

0.4 

recent  sellouts 

5 

specific 

1.9 

2.9 

10.6 

weather  figures 

5 

nonspec. 

1.2 

15.8 

15.0 

secison  Vciriables 

2 

nonspec. 

1.9 

10.7 

67.6 

numbers  of  hidden  units.  With  any  number  of  hidden 
units,  the  recent  sales  figures  come  out  to  be  most  rele¬ 
vant.  There  are,  however,  interesting  differences:  a  re¬ 
markable  increase  in  the  relevance  of  seasonal  variables 
when  going  from  one  to  two  hidden  units,  a  similar  in¬ 
crease  in  the  relevance  of  the  recent  sellouts  when  go¬ 
ing  from  two  to  three  hidden  units,  and  somewhat  less 
dramatic  increases  in  last  year’s  figures  and  weather 
information. 


Table  1:  List  of  input  variables  (see  text  for  further 
explanation)  on  the  lefthand  side.  Numbers  on  the 
righthand  side  give  the  percentage  of  variance  of  the 
features  explained  by  a  particular  group  of  input  vari¬ 
ables. 

2.3  FEATURES  FOR  THE  PREDICTION 
OF  NEWSPAPER  SALES 

The  explanatory  variables  that  we  took  into  account 
are  summarized  in  Table  1.  We  normalized  all  non¬ 
specific  variables.  Sales  figures  were  rescaled  for  each 
outlet  separately  as  described  in  Section  1.2.  Sellout 
figures  were  not  rescaled:  a  sellout  is  represented  by  1, 
a  non-sellout  by  0.  Recent  figures  start  from  4  weeks 
ago  (the  time  it  takes  to  collect  and  administrate  all 
sales  figures)  and  end  at  8  weeks  ago.  Figures  from 
last  year  are  from  exactly  the  same  week  and  the  week 
just  before  and  after  that.  Weather  information  in¬ 
cludes  temperature  (relative  to  the  average  tempera¬ 
ture  at  the  time  of  year),  wind  velocity,  percentage 
sunshine,  and  precipitation  (both  amount  and  dura¬ 
tion).  We  slightly  changed  the  definition  of  the  prob¬ 
ability  model  (2)  and  error  (3)  to  incorporate  sellouts 
(number  of  sold  copies  equal  to  the  number  of  deliv¬ 
ered  copies)  and  to  take  into  account  that  newspaper 
sales  is  always  integer. 

We  trained  networks  with  rifeatures  =  1  to  8  hidden 
units.  The  percentages  in  Table  1  indicate  what  part 
of  the  variance  in  each  of  the  features  is  explained  by 
a  particular  group  of  input  variables  for  Ufeaturcs  =  3. 
The  features  are  ordered  from  most  to  least  relevant. 
The  first  feature  strongly  focuses  on  the  recent  sales, 
the  second  mostly  on  the  sales  from  last  year,  the  third 
mostly  on  the  seasonal  variables.  Sellouts  and  weather 
figures  seem  to  play  a  minor  role,  although  especially 
the  weather  figures  explain  some  of  the  variance  of  the 
second  and  third  feature. 

We  can  also  compute  the  variance  in  the  outputs  ex¬ 
plained  by  each  group  of  input  variables.  The  cir¬ 
cles  in  Figure  3  show  these  percentages  for  different 


3  HIERARCHICAL  BAYES 

3.1  BAYESIAN  MODELING 

In  this  section,  we  replace  the  maximum  likelihood  ap¬ 
proach  of  the  previous  section  by  a  Bayesian  approach. 
We  will  focus  on  a  Bayesian  inference  of  the  model  pa¬ 
rameters  Aij  and  standard  deviations  cr,-,  given  the 
feature  matrix  obtained  in  the  previous  section. 
The  underlying  assumption  is  that,  if  there  indeed  ex¬ 
ist  features  typical  to  the  task  of  predicting  newspaper 
sales,  it  should  not  matter  too  much  whether  we  find 
these  through  an,  in  this  context  computationally  un¬ 
feasible,  Bayesian  approach  or  through  a  much  simpler 
maximum  likelihood  procedure.  Furthermore,  we  are 
making  lots  of  other  assumptions:  our  choice  of  possi¬ 
ble  explanatory  variables,  the  number  of  hidden  units, 
the  linear  transfer  function  and  thus  restriction  to  find 
linear  relationships,  and  so  on.  Each  set  of  assump¬ 
tions  corresponds  to  a  different  model  or  hypothesis 
71.  We  can  simply  include  in  our  definition  of 
7i.  In  the  following,  all  probability  distributions  are 
conditioned  on  this  7i.  We  will  omit  this  explicit  de¬ 
pendency  from  our  notation. 

Equation  (2)  gives  the  probability  distribution  of  the 
data  for  a  single  task  given  its  model  parameters.  The 
probability  distribution  of  all  data  follows  from 

P{V\A)  =  llP{DMi), 

i 

where  Ai  now  stands  for  all  model  parameters  of 
task  i  (including  the  standard  deviation  ct;),  A  = 
{i4i , . . . ,  },  and  2?  =  {-^i' •  •  •  i  ■^"lasks}’  ^ 

Bayesian  analysis,  we  infer  the  probability  of  the  model 
parameters  given  the  data  using  Bayes’  rule: 

=  (4, 

where  P(X>)  is  a  normalization  factor  independent  of 
the  model  parameters  and  P(A)  is  a  prior  distribution 
of  the  model  parameters. 
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last  year’s  sales  recent  sales  weather  figures 

last  year’s  sellouts  recent  sellouts  seasonal  variables 

Figure  3:  Percentage  of  variance  explained  by  each  group  of  input  variables  for  various  numbers  of  hidden  units: 
maximum  likelihood  solution  (circles,  dashed  lines)  and  most  probable  solutions  (crosses,  solid  lines). 


We  take  a  Gaussian  prior  on  the  model  parameters 

Ai  =  ) -^I.nf^jturea+l with  .^i.nfejtures+l  — 

logCTj; 


P(A,|A)  oc  exp 


■:^{Ai  -m)'^X{Ai  -m) 


Bayesian  procedure,  we  approximate  (4)  through 

P{A\V)  =  jdAP{A\A,V)P{A\V)!:aP{A\A^^,V), 

with  =  argmaxP(A|P) . 

A 


where  A  =  {A,  m}  is  called  a  set  of  hyperparameters 
with  A  an  [(ufeatures  +  2)  X  (Ufeatures  +  2)]-dimensional 
symmetric  matrix  and  m  an  (^features  +  2)-dimensional 
vector.  The  model  parameters  of  each  task  are  as¬ 
sumed  to  be  exchangeable,  i.e., 

p(>iiA)=n^(^iiA). 

i 

This  exchangeability  assumption  can  be  compared 
with  the  iid  assumption  in  (2).  It  implies  that,  prior 
to  the  arrival  of  data,  the  probability  distribution  of 
the  model  parameters  is  invariant  under  renumbering 
of  the  tasks.  This  is  not  directly  obvious,  but  may  be 
a  reasonable  assumption  if  the  outputs  for  each  of  the 
tasks  are  appropriately  rescaled,  as  discussed  in  Sec¬ 
tion  1.2.  Another  interpretation  is  that  the  parameters 
of  the  different  tasks  are  penalized  by  the  same  set  of 
hyperparameters . 

In  an  exact  Bayesian  procedure,  one  should  always 
integrate  out  the  hyperparameters.  In  a  hierarchical 


The  procedure  is  called  hierarchical  to  indicate  that 
the  hyperparameters  are  inferred  at  a  higher  level  than 
the  model  parameters.  The  idea  behind  this  approx¬ 
imation  is  that  the  distribution  P{A\T>)  is  sharply 
peaked  around  its  most  probable  value  In  our 

case,  where  we  can  use  the  data  for  all  ntasks  tasks 
to  infer  the  most  probable  ,  this  approximation  is 
extremely  accurate  and  useful.  We  will  simply  take  an 
(improper)  flat  prior  for  A,  i.e.,  P{A)  oc  1,  such  that 
the  most  probable  A^^  is  in  fact  equivalent  to  the  set 
of  maximum  likelihood  hyperparameters 

3.2  RELEVANT  LITERATURE 

A  nice  overview  of  hierarchical  (also  called  empirical) 
Bayesian  modeling,  with  both  a  discussion  of  its  under¬ 
lying  assumptions  and  lots  of  references  to  its  applica¬ 
tions  in  statistics,  can  be  found  in  [6].  Our  approach  is 
quite  similar  in  spirit  to  the  use  of  empirical  Bayesian 
techniques  in  law  school  validity  studies,  described  and 
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discussed  in  [9].  James-Stein  estimation  can  be  viewed 
as  the  frequentists’  equivalent  of  hierarchical  Bayesian 
modeling.  A  nice  link  is  provided  in  [4], 

In  the  neural-network  community,  hierarchical  Bayes 
is  often  referred  to  as  the  evidence  framework  [8].  The 
focus  is  on  learning  a  single  task,  where  the  prior  dis¬ 
tribution  of  the  weights  (usually  a  diagonal  matrix  A 
and  m  equal  to  zero)  is  chosen  to  reflect  the  belief 
that  weights  should  be  small.  This  yields  the  Bayesian 
justification  for  weight  decay  or  ridge  regression.  Al¬ 
though  from  a  technical  point  of  view  our  analysis  is 
at  some  points  quite  similar,  the  meaning  of  the  prior 
distribution  is  different:  our  choice  of  priors  has  noth¬ 
ing  to  do  with  an  a  prion  assumption  of  small  weights, 
only  with  exchangeability  under  a  Gaussian  probabil¬ 
ity  model. 


3.3  INFERENCE  OF  THE 
HYPERPARAMETERS 

To  find  the  most  probable  set  of  hyperparameters 
A^'*’,  we  have  to  maximize  the  posterior  distribution 
P{h.\V).  One  way  of  doing  this  is  through  an  EM  al¬ 
gorithm  (see  e.g.  [6]).  The  multi-task  situation  allows 
for  quite  a  lot  of  simplifications,  which  in  the  end  lead 
to  update  equations  for  A(n).  Here  we  only  state  the 
result; 


m{n  -b  1)  =  — Ai{n) 

”tasks  ^ 

A  ^{n+l)=- - S,- (n)  -b 

^tasks 

t 

- — [A(”)  -  »”(«  +  1)]  [^i(n)  -  rn{n  +  1)]^ 


^tasks 


where  the  Hessian  matrix  f/,(n)  of  the  error  E{A\Di) 
has  to  be  evaluated  at  A,  (n): 


Hi{n) 


d^^E(A\Di) 

OAdA'^ 


>4  =  /l,(ri) 


Laplace’s  method  becomes  more  and  more  accurate  for 
large  sample  sizes  p  per  task. 


The  EM  algorithm  is  intuitive  and  computationally 
feasible  with  the  approximation  suggested  by  Laplace’s 
method.  A  disadvantage  of  the  EM  algorithm  is  that 
its  convergence  can  be  rather  slow.  A  more  direct 
method  can  be  obtained  if  we  make  a  stronger  assump¬ 
tion,  namely  that  the  error  E{A\Di)  is  approximately 
quadratic  in  the  model  parameters  A,  i.e., 

E(A\Di)  «  E(A^"'|D,)  -b  i(A  -  A^^^ , 

(5) 

with  the  maximum  likelihood  solution  minimiz¬ 
ing  E(A\Di)  and  /f,  the  Hessian  evaluated  at  A^^‘. 
This  is  the  approximation  frequently  applied  in  the 
evidence  framework  for  neural  networks  (see  e.g.  [8]). 
Now  all  integrations  needed  to  compute 

P(A|2>)  oc  n/  dAP(Di\A)P(A\A), 


are  over  Gaussian  probability  distributions,  yielding 


logP(A|P)  =  _iy](Af’L-m)^Z.(A)(AM^-m) 

i 

-biy]log[detZ.(A)]  ,  (6) 

I 


with  Zi(X)  =  (P~'  -bA“^)  *  and  where  we  neglected 
irrelevant  additive  constants.  The  most  probable  A”^^ 
maximizes  (6)  and  can  be  found  using  e.g.  a  standard 
BEGS  quasi-Newton  algorithm. 


where  .A,(n)  and  Il?(n)  are  the  mean  and  variance 
of  the  distribution  P(A|P,',  A(n)),  respectively.  The 
second  term  on  the  righthand  side  measures  the  vari¬ 
ance  between  the  most  probable  solutions  [given  A(n)] 
for  the  different  tasks,  the  first  term  the  variance  of 
P(A|jD,',  A(n))  around  these  most  probable  solutions, 
averaged  over  all  tasks.  We  can  use  Laplace’s  method 
(see  [6]),  based  on  a  quadratic  Taylor  expansion  of 
log P(A|jDi,  A(n))  around  its  mode,  to  find  approxi¬ 
mations  for  Ai(n)  and  E?(n): 

Ai{n)  K  argmax  log P(A|£),-, A(n)) 

A 

and  E?(n)  Si  [P,(n) -b  An]“\ 


3.4  SIMULATIONS 

In  our  newspaper  example,  the  approximation  (5)  ap¬ 
peared  to  be  extremely  accurate.  A'^*’  was  therefore 
obtained  through  direct  optimization  of  (6).  Given 
this  A'^'’,  w'e  computed  the  most  probable  model  pa¬ 
rameters  A^^  exactly,  i.e.,  without  making  the  ap¬ 
proximation  (5).  The  difference  between  the  calcu¬ 
lation  of  the  maximum  likelihood  solutions  and  the 
most  probable  solutions  is  that  the  latter  are  regular¬ 
ized  through  the  hyperparameters  A^^. 

In  the  previous  section  we  noted  dramatic  changes 
in  the  relevance  of  groups  of  input  variables  with  in¬ 
creasing  number  of  hidden  units.  The  relevances  for 
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the  most  probable  solutions,  shown  by  the  circles  in 
Figure  3,  are  surprisingly  constant  across  the  differ¬ 
ent  networks  with  Ufeatures  >  1:  given  the  correct 
prior  parameters  ,  the  most  probable  solutions  are 
roughly  the  same.  Especially  the  influence  of  the  sell¬ 
outs,  which  seemed  to  be  highly  relevant  according  to 
the  maximum  likelihood  solutions,  almost  completely 
vanishes. 

4  DISCUSSION  AND 
CONCLUSION 

4.1  DEALING  WITH 

NONSTATIONARITY 

Until  now,  our  analysis  has  been  completely  static. 
However,  the  typical  examples  given  in  Section  1  are 
mostly  time-series  prediction  problems,  for  which  the 
iid  assumption  (2)  can  be  too  strong.  Suppose  that 
we  want  to  predict  the  output  z  at  “time”  p  given 
inputs  (we  leave  out  the  index  i  for  notational  con¬ 
venience).  As  in  the  previous  sections,  we  fit  the  pa¬ 
rameter  set  A  =  {0,  A,  (t)  on  a  training  set  containing 
the  most  recent  p  patterns.  This  is  a  kind  of  “sliding 
window  approach”:  with  the  addition  of  every  new 
pattern,  the  oldest  pattern  is  deleted  from  the  train¬ 
ing  set.  With  a  delay  of  naeUy  patterns  between  the 
most  recently  available  pattern  and  the  output  to  be 
predicted,  the  training  set  ends  at  p  —  ndeiay  The 
naive  sliding  window  approach  now  computes  the  out¬ 
put  from  the  input  and  the  scale  and  model  parame¬ 
ters  ,  which  in  a  way  assumes  stationarity  of 

the  scale  and  model  parameters,  i.e.,  . 

This  naive  approach  may  work  fine  for  many  predic¬ 
tion  tasks,  but  leads  to  lousy  predictions  on  some  of 
them. 

To  take  nonstationarity  into  account,  we  add  a  correc¬ 
tion  term  to  the  uncorrected  prediction: 

#*'t”delay  ^"t^delay  .  a 

^corrected  ^uncorrected  ^  * 

The  parameter  set  A  used  to  compute  the  uncorrected 
^uncorrected  determined  as  before  and  we  still  make 
the  assumption  that  this  parameter  set  is  roughly  sta¬ 
tionary  on  a  time  scale  of  a  few  patterns.  Any  nonsta¬ 
tionarity  should  be  corrected  through  A''.  A  simple, 
but  efficient  procedure  for  updating  is  through  an 
exponential  smoothing  procedure: 

A"  =  «<ncorrected  +  (l-“)A''-'  =  Corrected  +  A""  \ 

with  ^uncorrected  —  ~  '^uncorrected  ^corrected  ~ 

-  2;^oi.rected  difference  between  the  target  and 


the  uncorrected  and  corrected  prediction,  respectively, 
a  is  a  so-called  smoothing  parameter  and  1/a  corre¬ 
sponds  to  a  typical  time  scale.  It  seems  reasonable 
to  choose  the  same  a  for  all  tasks.  Furthermore,  it  is 
well-known  (see  e.g.  [3])  that  the  precise  setting  of  the 
smoothing  parameter  in  exponential  smoothing  hardly 
affects  the  prediction  performance  (see  also  Figure  4) . 
Perfectly  stationary  tasks  hardly  suffer  from  the  extra 
correction,  since  their  errors  and 

tend  to  average  out  anyways. 

4.2  TEST  PERFORMANCE 

Some  results  are  displayed  in  Figure  4.  All  ideas  pre¬ 
sented  in  this  paper  have  been  implemented  and  tested 
on  the  prediction  of  newspaper  sales  for  343  points  of 
sale.  The  test  set  consists  of  85  weeks  after  the  training 
set  that  has  been  used  for  computation  of  the  feature 
matrix,  hyperparameters  and  most  probable  model  pa¬ 
rameters.  The  model  parameters  are  updated  weekly 
using  the  sliding  windows  approach  described  above. 
The  hyperparameters  and  feature  matrix  have  been 
kept  constant.  The  test  error  is  minus  the  loglikeli- 
hood,  averaged  over  both  patterns  and  points  of  sale. 

The  network  with  two  hidden  units  appears  to  be  the 
best.  The  regularization  through  the  Bayesian  ap¬ 
proach  cannot  completely  avoid  the  risk  of  overfitting. 
On  the  other  hand,  the  best  solution  without  regular¬ 
ization  (not  shown)  is  the  one  with  one  hidden  unit, 
with  a  test  error  of  about  2.7,  increasing  rapidly  for 
more  hidden  units.  The  star  shows  the  test  perfor¬ 
mance  for  a  fixed  choice  of  the  feature  matrix  B,  made 
before  the  start  of  this  project  after  quite  a  lot  of  it¬ 
erations  of  thinking,  trying  and  testing.  The  solution 
obtained  through  the  multi-tasking  approach  is  signif¬ 
icantly  better.  The  righthand  side  shows  the  sensitiv¬ 
ity  to  the  choice  of  the  smoothing  parameter.  Taking 
a  —  0  is  suboptimal:  at  least  for  some  points  of  sale, 
the  time  series  are  clearly  nonstationary.  Any  choice 
of  a  typical  smoothing  time  between  half  a  year  and  a 
year  leads  to  about  the  same  performance. 

4.3  STATIONARITY  AND  SPECIFICITY 

A  summary  of  the  most  important  parameters  in  the 
complete  system  is  given  in  Table  2.  At  the  highest 
level,  we  have  global  parameters  as  the  number  of  fea¬ 
tures  Ufeatures  and  the  smoothing  parameter  a.  The 
choice  of  these  scalar  parameters  is  not  extremely  crit¬ 
ical  (see  Figure  4)  and  can  be  based  either  on  experi¬ 
ence  with  similar  databases  or  by  testing  a  few  differ¬ 
ent  alternatives.  This  is  much  less  the  case  for  the  next 
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Figure  4:  Test  error  (minus  loglikelihood)  averaged  over  85  weeks  and  343  points  of  sale.  Error  bars  indicate  the 
significance  of  the  difference  with  the  best  solution.  Lefthand  side:  as  a  function  of  the  number  of  hidden  units 
for  smoothing  parameter  a  =  0.05.  The  star  corresponds  to  the  performance  with  a  choice  of  three  features 
obtained  after  extensive  trial  and  error.  Righthand  side:  as  a  function  of  the  typical  smoothing  time  (number  of 
weeks)  for  the  network  with  two  hidden  units. 


level  of  parameters:  the  input-to-hidden  weights  and 
the  hyperparameters  of  the  prior  distribution.  These 
parameters  are  typical  to  the  task  at  hand.  For  exam¬ 
ple,  in  predicting  newspaper  sales,  they  may  be  quite 
different  for  different  days  of  the  week.  They  can  be 
optimized  on  a  representative  set  of  tasks  and  kept 
fixed  afterwards.  The  model  and  scale  parameters  as 
well  as  the  correction  terms  are  obviously  specific  to 
each  task.  We  assume  that  the  model  parameters  are 
roughly  stationary  over  the  length  of  the  training  set 
and  can  thus  be  determined  through  a  sliding  window 
approach.  The  correction  terms  can  be  interpreted 
as  corrections  to  the  scale  parameters.  These  may  be 
much  less  stationary  and  should  be  updated  with  the 
addition  of  every  single  pattern. 

4.4  IMPROVEMENTS  AND  FURTHER 
DIRECTIONS 

Let  us  recapitulate  our  approach  and  underlying  as¬ 
sumptions.  We  started  with  the  observation  that  we 
needed  some  transformation  from  the  possibly  quite 
high-dimensional  input  space  to  a  much  lower  dimen¬ 
sional  feature  space.  We  proposed  to  learn  this  trans¬ 
formation  through  a  maximum  likelihood  procedure 
on  the  weights  of  a  huge  network  containing  all  tasks. 
In  this  we  did  not  incorporate  any  prior  information, 
nor  did  we  worry  about  nonstationarity  of  the  time 
series  involved.  Keeping  the  weights  from  input  to 
hidden  units  fixed,  we  then  performed  a  hierarchical 
Bayesian  analysis  to  compute  hyperparameters,  which, 
roughly  speaking,  gave  us  the  proper  regularization  of 
the  model  parameters  specific  to  each  task.  Again, 


we  disregarded  any  of  the  nonstationarity  in  the  data. 
Finally,  keeping  both  the  hyperparameters  and  input- 
to-hidden  weights  fixed,  we  proposed  an  exponential 
smoothing  procedure  to  correct  for  nonstationarities. 

We  might  try  and  think  of  ways  to  integrate  the  parts 
of  our  approach,  instead  of  applying  them  sequentially. 
For  example,  it  may  be  possible  to  treat  the  input  to 
hidden  weights  as  hyperparameters,  i.e.,  at  the  same 
level  as  the  hyperparameters  A  for  the  mean  and  vari¬ 
ance  of  our  prior  distribution.  The  problem  here  is 
that  it  is  much  more  difficult  to  compute  how  a  change 
in  the  hyperparameters  of  the  prior  distribution  af¬ 
fects  the  input-to-hidden  weights  than  vice  versa.  Our 
treatment  follows  from  the  assumption  that  this  ef¬ 
fect  is  negligible  for  practical  purposes.  It  is  not  easy 
to  see  how  to  go  beyond  this  simplification,  without 
having  to  rely  on  procedures  that  are  computationally 
unfeasible  for  any  reasonable  number  of  tasks. 

About  integrating  nonstationarity  and  the  Bayesian 
hierarchical  analysis,  we  may  be  somewhat  more  pos¬ 
itive.  There  has  been  some  recent  work,  which  can 
be  viewed  as  a  first  attempt  to  combine  Kalman  fil¬ 
tering  and  the  Bayesian  evidence  framework  [5].  In 
this  approach  the  hyperparameters  are  recomputed 
every  time  step.  Similar  ideas  may  be  applicable  to 
our  multi-task  situation,  although  also  here  we  have 
to  worry  about  the  computational  feasibility. 

Another  improvement  could  be  to  work  with  a  more 
complicated  prior  for  the  model  parameters  of  the  dif¬ 
ferent  tasks  than  the  Gaussian  considered  in  this  pa¬ 
per.  One  suggestion  is  to  take  another  functional  form, 
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Symbol 

Description 

Time 

Tcisks 

Procedure 

^features 

number  of  hidden  units 

constant 

same 

experience/ test  performcmce 

a 

smoothing  parameter 

constant 

same 

experience/test  performcince 

B 

input-to-hidden  weights 

constant 

same 

multi-teisk  learning 

A 

hyperpaxameters 

constant 

same 

Bayesian  inference 

e 

sccile  parameters 

sliding  window 

specific 

maximum  likelihood 

A 

model  parameters 

sliding  window 

specific 

MAP  estimation 

A 

correction  terms 

single  pattern 

specific 

exponenticil  smoothing 

Table  2:  Characteristics  of  the  most  important  parameters. 


for  example,  a  cluster  of  Gaussians  or  a  prior  which 
forces  each  task  to  focus  on  a  subset  of  the  available 
features.  An  even  more  appealing  approach  would  be 
to  make  the  prior  distribution  dependent  on  (known) 
characteristics  of  the  particular  task.  In  our  newspa¬ 
per  case,  the  width  and  mean  of  the  distribution  could 
be  functions  of  the  distance  from  the  point  of  sale  to 
the  beach,  the  population  density  in  the  vicinity  of  the 
point  of  sale,  and  so  on.  The  hyperparameters  to  be 
inferred  from  the  data  would  be  the  parameters  in  this 
functional  dependency. 
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Abstract 

In  this  paper,  we  adopt  general-sum  stochas¬ 
tic  games  as  a  framework  for  multiagent  re¬ 
inforcement  learning.  Our  work  extends  pre¬ 
vious  work  by  Littman  on  zero-sum  stochas¬ 
tic  games  to  a  broader  framework.  We  de¬ 
sign  a  multiagent  Q-learning  method  under 
this  framework,  and  prove  that  it  converges 
to  a  Nash  equilibrium  under  specified  condi¬ 
tions.  This  algorithm  is  useful  for  finding  the 
optimal  strategy  when  there  exists  a  unique 
Nash  equilibrium  in  the  game.  When  there 
exist  multiple  Nash  equilibria  in  the  game, 
this  algorithm  should  be  combined  with  other 
learning  techniques  to  find  optimal  strategies. 

1  Introduction 

Reinforcement  learning  has  gained  attention  and  ex¬ 
tensive  study  in  recent  years  [8,  15].  As  a  learning 
method  that  does  not  need  a  model  of  its  environment 
and  can  be  used  online,  reinforcement  learning  is  well- 
suited  for  multiagent  systems,  where  agents  know  lit¬ 
tle  about  other  agents,  and  the  environment  changes 
during  learning.  Applications  of  reinforcement  learn¬ 
ing  in  multiagent  systems  include  soccer  [1],  pursuit 
games  [17,  4]  and  coordination  games  [2].  In  most 
of  these  systems,  single-agent  reinforcement  learning 
methods  are  applied  without  much  modification.  Such 
approach  treats  other  agents  in  the  system  as  a  part 
of  the  environment,  ignoring  the  difference  between  re¬ 
sponsive  agents  and  passive  environment.  In  this  pa¬ 
per,  we  propose  that  a  multiagent  reinforcement  learn¬ 
ing  method  should  explicitly  take  other  agents  into 
account.  We  also  propose  that  a  new  framework  is 
needed  for  multiagent  reinforcement  learning. 


The  framework  we  adopt  is  stochastic  games  (also 
called  Markov  games)  [5,  18],  which  are  the  general¬ 
ization  of  the  Markov  decision  processes  to  the  case  of 
two  or  more  controllers.  Stochastic  games  are  defined 
as  non-cooperative  games,  where  agents  pursue  their 
self-interests  and  choose  their  actions  independently. 

Littman  [9]  has  introduced  2-player  zero-sum  stochas¬ 
tic  games  for  multiagent  reinforcement  learning.  In 
zero-sum  games,  one  agent’s  gain  is  always  the  other 
agent’s  loss,  thus  agents  have  strictly  opposite  in¬ 
terests.  In  this  paper,  we  adopt  the  framework  of 
general-sum  stochastic  games,  in  which  agents  need 
no  longer  have  opposite  interests.  General-sum  games 
include  zero-sum  games  as  special  cases.  In  general- 
sum  games,  the  notions  of  “optimality”  loses  its  mean¬ 
ing  since  each  agent’s  payoff  depends  on  other  agents’ 
choices.  The  solution  concept  Nash  equilibrium  [11]  is 
adopted.  In  a  Nash  equilibrium,  each  agent’s  choice  is 
the  best  response  to  the  other  agents’  choices.  Thus, 
no  agent  can  gain  by  unilateral  deviation. 

we  are  interested  in  the  Nash  equilibrium  solution  be¬ 
cause  we  want  to  design  learning  agent  for  noncoopera¬ 
tive  multiagent  systems.  In  such  systems,  every  agent 
pursues  its  own  goal  and  there  is  no  communication 
among  agents.  A  Nash  equilibrium  is  more  plausible 
and  self-enforcing  than  any  other  solution  concept  in 
such  systems. 

If  the  payoff  structure  and  state  transition  probabil¬ 
ities  are  known  to  all  the  agents,  we  can  solve  for 
an  Nash  equilibrium  strategy  using  a  nonlinear  pro¬ 
gramming  method  proposed  by  Filar  and  Vrieze  [5]. 
In  this  paper,  we  are  interested  in  situations  where 
agents  have  incomplete  information  of  other  agents’ 
payoff  functions  and  the  state  transition  probabilities. 
We  show  that  an  multiagent  Q-learning  algorithm  can 
be  designed,  and  it  converges  to  the  Nash  equilibrium 
Q  values  under  certain  restrictions  of  the  game.  Our 
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algorithm  is  designed  for  2-player  general-sum  stochas¬ 
tic  games,  but  can  be  extended  to  n-player  general-sum 
games. 

Our  learning  algorithm  guarantees  that  an  agent  can 
learn  a  Nash  equilibrium.  But  it  does  not  say  whether 
the  other  agent  will  learn  the  same  Nash  equilibrium. 
When  there  exist  only  one  Nash  equilibrium  in  the 
game,  our  learning  algorithm  works  effectively.  How¬ 
ever,  a  game  can  have  multiple  Nash  equilibria.  In  that 
case,  our  learning  algorithm  needs  to  be  combined  with 
empirical  estimation  of  the  action  choices  of  the  other 
agent. 

2  Some  preliminaries 

We  state  some  basic  game  theory  concepts  in  this  sec¬ 
tion.  All  concepts  here  refer  to  single-state  (static) 
games.  In  later  sections,  we  will  see  how  the  concepts 
here  are  connected  to  multi-state  stochastic  games. 

For  zero-sum  games,  the  payoff  matrices  of  two  players 
can  be  described  as  (M,  — M),  since  one  player’s  payoff 
is  always  the  negative  of  the  other.  It  is  sufficient  to 
simplify  the  game  by  either  M  or  -M.  Thus,  2-player 
zero-sum  games  are  also  called  matrix  games.  For  2- 
player  general-sum  games,  the  agents’  payoff  matrices 
and  are  unrelated.  The  solutions  of  the  game 
depend  on  both  and  M^.  Such  games  are  called 
bimatrix  games. 

Definition  1  A  pair  of  matrices  consti¬ 

tutes  a  bimatrix  game,  where  and  are  of  the 
same  size.  The  payoff  r'^{a^ ,a^)  to  player  k  can  be 
found  in  the  corresponding  entry  of  the  matrix  M*, 
k  =  1,2.  The  rows  of  M'’  correspond  to  actions  of 
player  1,  G  .  The  columns  of  M*  correspond  to 
actions  of  player  2,  .  A^  and  A?  are  the  sets 

of  discrete  actions  of  players  1  and  2  respectively. 

Next,  we  state  some  solution  concepts  for  bimatrix 
games.  The  main  concept  is  Nash  equilibrium  [12]. 
In  a  Nash  equilibrium,  each  agent’s  action  is  the  best 
response  to  other  agents’  choices. 

Definition  2  A  pure  strategy  Nash  equilibrium  for 
bimatrix  game  G  is  an  action  profile  (a];,a^)  such  that 

r^{a\,a^)  >  r^(a^,a^)  for  all  of  £ 
r^{al,al)  >  r^(ai,a^)  for  all  €  A^ 

An  example  of  a  bimatrix  game  can  be  seen  in  Figure 
1,  in  which  the  strategy  pair  (oj,  af)  constitutes  a  pure 
strategy  Nash  equilibrium. 
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Figure  1;  A  bimatrix  game  example 


Definition  3  A  mixed  strategy  Nash  equilibrium  for 
bimatrix  game  G  is  a  pair  of  vectors  {pi,  p^),  such  that 

pIM^pI  >  p^M^pI  for  all  p^  £  <x{A^) 
pIM^pI  >  pIM'^p^  for  all  p^  £  cr(A^) 

where  (t(A*^)  is  the  set  of  probability  distributions  over 
action  space  A^,  such  that  for  any  p^  £  o’(A*'), 
EaeA^  =  1-^ 

p^M^p^  =  ^^1  !Eo2  p^(a^)r^(aSo^)p^(a2)  is  the  ex¬ 
pected  payoff  of  agent  1  under  the  situation  that 
playerl  and  player  2  adopt  their  mixed  strategies  p^ 
and  p^  respectively. 

The  reason  we  are  interested  in  mixed  strategies  is 
that  an  arbitrary  bimatrix  game  may  not  have  a  pure 
strategy  Nash  equilibrium,  but  it  always  has  a  mixed 
strategy  Nash  equilibrium. 

Theorem  1  (Nash,  1951)  There  exists  a  mixed  strat¬ 
egy  Nash  equilibrium  for  any  finite  bimatrix  game. 

A  mixed  strategy  Nash  equilibrium  for  any  bimatrix 
game  can  be  found  by  Mangasarian-Stone  algorithm 
[10],  which  is  a  quadratic  programming  algorithm. 

3  Markov  Decision  Process  and 
reinforcement  learning 

For  comparison  purpose,  we  state  the  framework  of 
Markov  decision  process  here.  Later  we  can  see  how 
the  stochastic  game  framework  is  related  to  Markov 
decision  process. 

Definition  4  A  Markov  Decision  Process  is  a  tuple 
<  S,A,r,p  >,  where  S  is  the  discrete  state  space,  A 
is  the  discrete  action  space,  r  :  S  x  A  R  is  the 
reward  function  of  the  agent,  and  p  :  SxA-^  A  is  the 
transition  function,  where  A  is  the  set  of  probability 
distributions  over  state  space  S. 

^We  abuse  the  notation  a  little  here.  plM^pl  should  be 
{ply M^pl,  where  pi  is  transposed  before  being  multiplied 
to  the  matrix  . 


244  Hu  and  Wellman 


In  a  Markov  decision  process,  the  objective  of  the 
agent  is  to  find  a  strategy  (policy)  tt  so  as  to  maxi¬ 
mize  the  expected  sum  of  discounted  rewards, 

OO 

v{s,  tt)  =  ^  (3^E{rt\n,  sq  =  s)  (1) 

t=o 

where  So  is  the  initial  state,  rt  is  the  reward  at  time  t, 
and  /3  €  [0, 1)  is  the  discount  factor.  We  can  rewrite 
Equation  (1)  as 

?;(s,7r)  =  r(s,a„)  +  /3^p(s'js,a„)v(s',7r)  (2) 

s' 

where  a„  is  action  determined  by  policy  tt.  It  has  been 
proved  that  there  exists  an  optimal  policy  tt*  such  that 
for  any  s  G  5,  the  following  Bellman  equation  holds: 

u(s,7r*)  =  max|r(s,a)  4- /9  ^p(s'|s,  a)u(s',7r*)|, 

s' 

(3) 

where  v(s,n*)  is  called  the  optimal  value  for  state  s. 

If  the  agent  knows  the  reward  function  and  the  state 
transition  function,  it  can  solve  for  tt*  by  some  iter¬ 
ative  searching  methods  [13].  The  learning  problem 
arises  when  the  agent  does  not  know  the  reward  func¬ 
tion  or  the  state  transition  probabilities.  Now  the 
agent  needs  to  interact  with  the  environment  to  find 
out  its  optimal  policy.  The  agent  can  learn  about 
the  reward  function  and  the  state  transition  function, 
and  then  solve  for  its  optimal  policy  using  Equation 
(3).  Such  approach  is  called  model-based  reinforce¬ 
ment  learning.  The  agent  can  also  directly  learn  about 
its  optimal  policy  without  knowing  the  reward  func¬ 
tion  or  the  state  transition  function.  Such  approach 
is  called  model-free  reinforcement  learning.  One  of 
the  model-free  reinforcement  learning  methods  is  Q- 
learning  [19]. 

The  basic  idea  of  Q-learning  is  that  we  can  define  the 
right-hand  side  of  Equation  (3)  as 

Q*{s,a)  =  r{s,a)  +  I3'^p{s'\s,a)v{s'  ,TT*)  (4) 

s' 

By  this  definition,  Q*(s,a)  is  the  total  discounted  re¬ 
ward  attained  by  taking  action  a  in  state  s  and  then 
following  the  optimal  policy  thereafter.  Then  by  Equa¬ 
tion  (3), 

?;(s,7r*)  =  maxQ*(s,a).  (5) 

a 

If  we  know  Q*(s,a),  then  the  optimal  policy  tt*  can 
be  found,  which  is  alway  taking  an  action  so  as  to 
maximize  Q*{s,a)  under  any  state  s. 


In  Q-learning,  the  agent  starts  with  arbitrary  initial 
values  of  Q{s,a)  for  all  s  G  5,  a  G  >1.  At  each  time 
t,  the  agent  choose  an  action  and  observes  its  reward 
Tt-  The  agent  then  updates  its  Q-values  based  on  the 
following  Equation: 

<5t+i(s,a)  =  (1  -Qt)Qt(s,a)  +  at[rt  -l-/3maxQt(s',6)]. 

b 

(6) 

where  at  G  [0, 1)  is  the  learning  rate.  The  learning  rate 
at  needs  to  decay  over  time  in  order  for  the  learning 
algorithm  to  converge.  Watkins  and  Dayan  [19]  proved 
that  sequence  (6)  converges  to  the  optimal  Q*{s,a). 

4  The  stochastic  game  framework 

Markov  decision  process  (MDP)  is  a  single  agent  de¬ 
cision  problem.  A  natural  extension  of  MDP  to  mul¬ 
tiagent  systems  is  stochastic  games,  which  essentially 
are  n-agent  Markov  decision  processes.  In  this  paper, 
we  focus  on  2-player  stochastic  games  since  they  have 
been  well  studied. 

4.1  Definition  of  stochastic  games 

Definition  5  A  2-player  stochastic  game  T  is  a  6- 
tuple  <  5,  >,  where  S  is  the  discrete 

state  space,  A’’  is  the  discrete  action  space  of  player 
k  for  fc  =  1,2,  r*  :  5  X  A*  X  -4  J?  is  the  payoff 
function  for  player  k,  p  :  S  x  A^  x  A^  A  is  the  tran¬ 
sition  probability  map,  where  A  is  the  set  of  probability 
distributions  over  state  space  S. 

To  have  a  closer  look  at  a  stochastic  game,  consider 
a  process  that  is  observable  at  discrete  time  points 
t  =  0, 1, 2, . . ..  At  each  time  point  t,  the  state  of 
the  process  is  denoted  by  St-  Assume  St  takes  on 
values  from  the  set  S.  The  process  is  controlled 
by  2  decision  makers,  referred  to  as  player  1  and 
player  2,  respectively.  In  state  s,  each  player  inde¬ 
pendently  chooses  actions  G  A^,a?  G  A^  and  re¬ 
ceives  rewards  r^(s,a\a^)  and  r^(s,a\a^),  respec¬ 
tively.  When  r*(s,a^,o^)  -t- r^(s,o^,o^)  =  0  for  all 
s,a^,a^,  the  game  is  called  zero  sum.  When  the  sum 
is  not  restricted  to  0  or  any  constant,  the  game  is  called 
a  general-sum  game. 

It  is  assumed  that  for  every  s,  s'  G  S,  the  transition 
from  s  to  s'  given  that  the  players  take  actions  G  A^ 
and  G  A^,  is  independent  of  time.  That  is,  there 
exist  stationary  transition  probabilities  p(s'|s,a^,a^) 
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for  alH  =  0, 1, 2, . . .  ,  satisfying  the  constraint 

m 

]^p{s'|s,a^a2)  =  1,  (7) 

s'=l 

The  objective  of  each  player  is  to  maximize  a  dis¬ 
counted  sum  of  rewards.  Let  /3  €  [0, 1)  be  the  discount 
factor,  let  tt^  and  tt^  be  the  strategies  of  players  1  and  2 
respectively.  For  a  given  initial  state  s,  the  two  players 
receive  the  following  values  from  the  game: 

OO 

^;^(s,7r^7r2)  =  '^^*E{rj\Tr^,Tr^,So  =  s)  (8) 

t=Q 

OO 

1)2(5, TT^TT^)  =  '^0^E{r^\n^,-K‘^,So  =  s)  (9) 

t=0 

A  strategy  tt  =  (ttq,  . . .  ,7rt, . . .)  is  defined  over  the 
whole  course  of  the  game.  TTt  is  called  the  decision  rule 
at  time  t.  A  strategy  n  is  called  a  stationary  strategy 
\^-K^  =  ^t  for  all  t,  where  the  decision  rule  is  fixed  over 
time.  TT  is  called  a  behavior  strategy  if  ttj  =  /(/if), 
where  ht  is  the  history  up  to  time  t, 

ht  =  (10) 

A  stationary  strategy  is  a  special  case  of  behavior 
strategy  when  ht  —  Q. 

A  decision  rule  assigns  mixed  strategies  to  different 
states.  A  decision  rule  of  a  stationary  strategy  has  the 
following  form:  tt  =  (7f(s^), . . . ,  7f  (s"*)),  where  m  is  the 
maximal  number  of  states.  7f(s)  is  a  mixed  strategy 
under  state  s. 

A  Nash  equilibrium  for  stochastic  games  is  defined  as 
following,  assuming  that  the  players  have  complete  in¬ 
formation  about  the  payoff  functions  of  both  players. 

Definition  6  In  stochastic  game  F,  a  Nash  equilib¬ 
rium  point  is  a  pair  of  strategies  (tt^  ,  tt^)  such  that  for 
all  s  G  S 

v^{s,TTl,Trl)  >v^{s,Tr^,-Kl)  VTT^en^ 

and 

1)2(5, 7ri, 7r2)  >  t;2(s,7ri,7r2)  V7r2  €  n2 

The  definition  of  Nash  equilibrium  requires  that  each 
agent’s  strategy  is  a  best  response  to  the  other’s  strat¬ 
egy.  Such  definition  of  Nash  equilibrium  is  similar  as 
in  other  games.  The  strategies  that  constitute  a  Nash 
equilibrium  can  be  behavior  strategies,  Markov  strate¬ 
gies,  or  stationary  strategies.  In  this  paper,  we  are 


t=0  t=1 


Figure  2:  Stochastic  games  and  bimatrix  games 


interested  in  stationary  strategies,  which  axe  the  most 
simple  strategies.  The  following  theorem  shows  that 
there  always  exist  a  Nash  equilibrium  in  stationary 
strategies  for  any  stochastic  game. 

Theorem  2  (Filar  and  Vrieze  [5],  Theorem  4-^-4) 
Every  general-sum  discounted  stochastic  game  pos¬ 
sesses  at  least  one  equilibrium  point  in  stationary 
strategies. 

4.2  Stochastic  games  and  bimatrix  games 

We  can  view  each  stage  of  a  stochastic  game  as  a  bi¬ 
matrix  game,  as  in  Figure  2. 

At  each  time  period  of  a  stochcistic  game,  under  state 
5,  agent  1  and  2  choose  their  actions  independently  and 
receive  their  payoflFs  according  to  the  bimatrix  game 
(r^(s),r2(5)).  Repeated  games  can  be  seen  as  a  de¬ 
generate  case  of  stochastic  games  when  there  is  only 
one  state.  For  example,  let  s  be  the  index  of  the  only 
state,  a  repeated  game  will  always  have  the  bimatrix 
game  (r^(s),r2(s))  at  each  time  period. 

5  Multiagent  reinforcement  learning 

We  want  to  extend  traditional  reinforcement  learning 
method  based  on  Markov  decision  process  to  stochas¬ 
tic  games.  We  assume  that  our  games  have  incomplete 
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but  perfect  information,  meaning  agents  do  not  know 
other  agents’  payoff  functions  but  they  can  observe 
other  agents’  immediate  payoffs  and  actions  taken  pre¬ 
viously. 

5.1  Issues  in  designing  a  multiagent 
Q-learning  algorithm 

The  target  of  our  Q-learning  is  the  optimal  Q-values, 
which  we  define  as  the  following: 

Ql{s,a\a^)  = 

N 

(s,  ,  o^)  +  I3'^p{s'\s,  a' ,  a^)v^  {s',  tt*  ,  tt^)  (11) 

s'  =  l 

N 

r‘^{s,a^,a'^)  -i- /3^p(s'|s,  a*,  a^)i;^(s',7r\7r^)  (12) 
«'=i 

The  optimal  Q-value  of  state  s  and  action  pair  (a\o^) 
is  the  total  discounted  reward  received  by  an  agent 
when  both  agents  execute  actions  (a^,a^)  in  state  s 
and  follow  their  Nash  equilibrium  strategies  (7r*,7r^) 
thereafter. 

To  learn  about  these  Q-values,  an  agent  needs  to  main¬ 
tain  m  Q-tables  for  its  own  Q-values,  where  m  is  the 
total  number  of  states.  For  each  agent  k,  fc  =  1,2,  a 
Q-table  Q^{s)  has  its  rows  corresponding  to 
columns  corresponding  to  and  each  entry 

as  Q'‘{s,A,a^),  A:  =  1,2.  The  total  number  of  en¬ 
tries  agent  k  needs  to  learn  is  m  x  |>1*|  x  where 
l^l^l  and  |yl^|  are  the  sizes  of  action  spaces  and 
A^.  Assuming  |A^|  =  \A^\  =  |A|,  then  space  require¬ 
ment  is  m  X  |Ap.  For  n  agents,  the  space  requirement 
is  m  X  |A|’’,  which  is  exponential  in  the  number  of 
agents.  Thus  for  large  number  of  agents,  we  need  to 
find  some  compact  representation  of  action  space. 

As  in  single-agent  Q-learning,  the  learning  agent  in 
multiagent  systems  updates  its  Q  tables  for  a  given 
state  after  it  observes  the  state,  actions  taken  by  both 
agents,  and  the  rewards  received  by  agents.  The  dif¬ 
ference  is  in  the  updating  rule.  In  single-agent  Q- 
learning,  the  Q-values  are  updated  as  following, 

Qt+i{s,a)  =  (1  -  at)Qt{s,a)  +  at[rt  -I- ^maxQt(s',6)]. 

6 

In  multiagent  Q-learning,  we  cannot  just  maximize  our 
own  Q-values  since  the  Q-values  depend  on  the  action 
of  the  other  agent. 

If  it  is  a  zero-sum  game,  we  can  minimize  over  the  other 
agent’s  actions,  and  then  choose  our  own  maximal  af¬ 
ter  that.  This  is  the  minimax-Q  learning  algorithm  in 


update 

a}  r}  Q\s„a\A)  <1 

a,'  r;  .V,.,  Q'(s,.a!.a‘) 

- ^ 

t  t+1  time 

Figure  3:  Time  line  of  actions 

Littman  [9].  For  general-sum  games,  we  cannot  use 
mini-max  algorithm  because  the  two  agent’s  payoffs 
are  not  the  opposite  of  each  other.  We  propose  that 
an  agent  adopt  a  Nash  strategy  to  update  its  Q-values, 
and  this  is  the  best  an  agent  can  do  in  a  general-sum 
game. 

5.2  A  multiagent  Q-learning  algorithm 

Our  Q-learning  agent,  say  agent  1,  updates  its  Q- 
values  according  to  the  following  rule: 

Qt+i(s,a\a^)  = 

(1  -  Q,)Qj(s,a’,a*)  -I-  Q([rJ  -f-  ;37r’(s')Qj(s')7r^(s')I13) 

where  (7r^(s'),7r^(s'))  is  a  mixed  strategy  Nash  equi¬ 
librium  for  the  bimatrix  game  (Q}(s'),Qt(^'))- 
der  to  find  out  7r^(s'),  agent  1  needs  to  learn  about 
Q((s')  in  the  game.  The  learning  is  as  following: 

Q?+l(Si  0*1  “*)  = 

(1  -  q,)Q?(s. +  “iK  +  /9^‘(s')Q?(s')’^*(s')II4) 

Therefore,  a  learning  agent  maintains  two  Q-tables 
for  each  state,  one  for  its  own  Q-values  and  one  for 
the  other  agent’s.  This  is  possible  since  we  assume  an 
agent  can  observe  the  other  agent’s  immediate  rewards 
and  previous  actions  during  learning. 

The  detail  of  our  Q-learning  algorithm  is  stated  in  Ta¬ 
ble  1. 

When  the  game  is  zero-sum,  Q*(s,a\a^)  = 

—Q^(s,A,a^)  =  Q(s,A,a^).  Thus  agent  1  needs  to 
learn  only  one  Q-table  for  every  state.  Our  Q-learning 
algorithm  becomes, 

Q,+i(s,a\a^)  = 

(1  -  at)Qt(s,A,a^)  +  at[rt  + /3  max 

This  is  different  from  Littman’s  minimax-Q  learning 
algorithm  where  Q-value  is  updated  as 

(1  -  a«)(5t(s,a\a^) -t- Qt[rt -f /?  max 
min  7r'(s')Qt(s',o^)] 
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Table  1:  Multiagent  Q-learning  algorithm  for  Agent  1 

Initialize: 

Let  t  =  0, 

For  all  s  in  S,  in  A^,  and  in 
let  Qj(s,a\a^)  =  l,Qf(s,a\a^)  =  1 
initialize  sq 
Loop 

Choose  action  oj  based  on  7r^(st),  which  is  a 
mixed  strategy  Nash  equilibrium  solution  of  the 
bimatrix  game  (Q^(st),Q^(st)). 

Observe  r} ,  rf ,  af ,  and  st+i 

Update  Q^,  and  such  that 

(5j+i(s,a\a2)  =  (l-at)<^Ks.«^a^)  +  «tK  + 

/3rr^  (st+i)Qi  (st+ih^  (st+i )] 

Q2^i(s,a^a2)  =  (1  -  at)Qf(s,a\a^)  +  at[rl  + 

(st+i )  Qt  (st+i  )7r^  (st+i )] 
where  (7r^(st+i),7r^(st+i))  are  mixed  strategy 
Nash  solutions  of  the  bimatrix  game  {Q^{st+i), 

Let  t  :=t  +  l 


In  Littman’s  Q-learning  algorithm,  it  is  assumed  that 
the  other  agent  will  always  choose  a  pure  Nash  equi¬ 
librium  strategy  instead  of  a  mixed  strategy. 

Another  thing  to  note  is  that  in  our  Q-learning  algo¬ 
rithm,  how  an  agent  chooses  its  action  at  each  time  t 
is  not  important  for  the  convergence  of  the  learning. 
But  the  action  choices  are  important  for  short-term 
performance.  In  this  paper,  we  have  not  studied  the 
issue  of  action  choice,  but  will  explore  it  in  our  future 
work. 

5.3  Convergence  of  our  algorithm 

In  this  section,  we  prove  the  convergence  of  our  Q- 
learning  algorithm  under  certain  assumptions.  The 
first  two  assumptions  are  standard  ones  in  Q-learning: 

Assumption  1  Every  state  and  action  have  been  vis¬ 
ited  infinitely  often. 

Assumption  2  the  learning  rate  at  satisfies  the  fol¬ 
lowing  conditions: 

1.  0  <  at  < 

2.  at{s,a},a?)  =  0  i/(s,o\a^)  (st,aj,af). 

We  make  further  assumptions  regarding  the  structure 
of  the  game: 


Assumption  3  A  Nash  equilibrium  (7r^(s),7r^(s))  for 
any  bimatrix  game  (Q^(s),Q^(s))  satisfies  one  of  the 
following  properties: 

1.  The  Nash  equilibrium  is  global  optimal. 

n^{s)Q'°{s)n^{s)  >  rt^ {s)Q'‘ {s)Tt^ (s)  V7r^(s)  £ 

(T(A^),7r^(s)  G  cr{A^),  and  k  =  1,2. 

2.  If  the  Nash  equilibrium  is  not  a  global  optimal, 
then  an  agent  receives  a  higher  payoff  when  the 
other  agent  deviates  from  the  Nash  equilibrium 
strategy. 

7r^(s)Q^(s)7r^(s)  <  7r^(s)(3^(s)7r^(s)  V7r^(s)  € 

(r(A^),  and 

7r^(s)Q^(s)7r^(s)  <  7r^(s)<3^(s)7r^(s)  V7r^(s)  € 

a(Ai). 

Our  convergence  proof  is  based  on  the  following  two 
Lemmas  proved  by  Szepesvari  and  Littman  [16]. 

Lemma  1  (Conditional  Average  Lemma)  Under  As¬ 
sumptions  1-2,  the  process  Qt+i  =  (1  —  at)Qt  +  atwt 
converges  to  E{wt\ht,at),  where  ht  is  the  history  at 
time  t. 

Lemma  2  Under  Assumptions  1-2,  If  the  process  de¬ 
fined  by  Ut+iix)  =  (1  -  at{x))Ut{x)  +  af(a:)[PtU*](x) 
converges  to  v*  and  Pt  satisfies  ||  PtV  —  PtV*  ||<  7 
II  y  -  u*  II  -(-At  for  all  V,  where  0  <  7  <  1  and  At  >  0 
converges  to  0,  then  the  iteration  defined  by 

14+1(2;)  =  (1  -  at(2:))V't(a:)  +  at(x)[Ptyt](a:) 

converges  to  v*. 

In  order  to  prove  that  the  convergence  point  of  our 
Q-learning  algorithm  is  actually  the  Nash  equilibrium 
point,  we  need  the  following  theorem  proved  by  Filar 
and  Vrieze  [5]. 

Theorem  3  (Filar  and  Vrieze  [5])  The  following  as¬ 
sertions  are  equivalent: 

1.  For  each  s  6  5,  the  pair  (71^(5), 7r^(s))  con¬ 
stitutes  an  equilibrium  point  in  the  static  bima¬ 
trix  game  (^Q^(s),Q^(s)J  with  equilibrium  pay¬ 
offs  (v^{s,n^,n‘^),v^{s,n^,n^)^,  and  for  k=l,2 
the  entry  (a^,a^)  in  Q'‘{s)  equals 

Q''{s,a},a^)  = 

N 

(s,  a^  a^)  +  I3'^p{s'\s,  ,  0^)1;*’  (s',  ,  tt^). 

s'=l 
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£.  (tt^jTT^)  is  an  equilibrium  point  in  the  dis¬ 
counted  stochastic  game  F  with  equilibrium  pay¬ 
off  ^v^(7r^,ir^),v^(Tr^,Tr^)'j,  where  v*(7r\7r^)  = 
(t;*(s\7r\7r2),---,u*(s"‘,7r\7r2)),  k  =  1,2. 

The  above  theorem  showed  that  the  Nash  solution  of 
the  bimatrix  game  (Q^(s),  Q^(s))  defined  in  Theorem 
3  will  edso  be  part  of  the  Nash  solution  for  the  whole 
game.  If  the  sequence  in  our  Q-learning  algorithm  con¬ 
verges  to  the  Q-V£ilues  defined  in  Theorem  3,  then  a 
pair  of  stationary  Nash  equilibrium  strategies 
can  be  derived,  where  fr*'  =  (7f*(s^), •  •  •  ,7f*'(s'"))  for 
A:  =  1, 2.  For  each  state  s,  7f*(s)  is  part  of  a  Nash  equi¬ 
librium  solution  of  the  bimatrix  game  (Q^(s),Q^(s)). 

Lemma  3  Let  Pt‘Q'‘{s)  =  r*  -t- /37r^(s)Q*'(s)7r^(s), 
k  =  1,2,  where  (7r^(s),7r^(s))  is  a  pair  of  mixed 
Nash  equilibrium  strategies  for  the  bimatrix  game 
Then  Pt  =  {Pf,P^)  is  a  contraction 

mapping. 

Proof  Case  1:  PfQ’‘{s)  >  PtQ’‘{s)  Wk  =  1, 2. 

We  have 

0  <  PtQHs)  ~  PtQHs) 

=  p  (s)<3^  (s)  -  TT^  (s)Q^  (s)w^  (s)) 

<  /8  (s)Q^  (s)7r^  (s)  -  TT^  (s)(3^  (s)7r^  (s))  (15) 

<  /?  ^7r^(s)Q^(s)7r^(s)  -  7r*(s)Q*(s)7r^(s)^  (16) 

Q^(s,a\a^))  (17) 

<  II  Q^(s)  -  Qi(s)  || 

=  i3||Q^(s)-Qi(s)||, 

where  ||  <3*'(s)  -  Q'‘(s)  ||=  maxai^as  |<5''(s,a^a^)  - 
(5*(s,a^,o^)|.  Inequality  (15)  derives  from  definition 
of  Nash  equilibrium.  Inequality  (16)  is  from  property 
2  of  Assumption  2.  For  cases  satisfying  property  1  of 
Assumption  2,  the  proof  is  simpler,  and  we  omit  it 
here. 

k  =  2,  similar  proof  as  above.  Under  property  1  of 
Assumption  2,  we  have 

0  <  P’^QHs)  -  Pl^Q^s) 

<  /3^^7r^(s,a^)7r2(s,a2)  ||  Q^{s)-Q^(s)  || 

=  /3||Q^(s)-Q^(s)||. 


Under  property  2  of  Assumption  2,  we  have 
0  <  Pl‘Q\s)-P’‘Q^{s) 

<  II  Q^{s)-Q^{s)  II 

a' 

=  /3||g^(s)-Q2(s)ll. 

Case  2:  P/'Q*(s)  <  P/'Q*(s).  Similar  proof  as  in  Case 
1.  For  A:  =  1,  under  property  2  of  Assumption  2,  we 
have 

0  <  P,^Q\s)-P,^QHs) 

<  P'^'^^\s,a^)TT'^{s,a^)  II  Q\s)-Q\s)  || 

o' 

=  ;0||Q‘(s)-Qi(s)||. 

Therefore  we  have  \PfQ'‘{s)-PfQ'‘{s)\  <  /?  ||  Q*(s)- 
Q*(s)  II .  Since  this  holds  for  every  state  s,  we  have 
II  PfQ'^  -  PfQ*^  ||<  /3  II  Q*  -  0*=  ||.  □ 

Now  we  proceed  to  prove  our  main  theorem,  which 
states  that  the  multiagent  Q-learning  methods  con¬ 
verges  to  the  “optimal”  (Nash  equilibrium)  Q  values. 

Theorem  4  In  stochastic  game  T,  under  Assump¬ 
tions  1-3,  the  coupled  sequences  {Qt,Qt}>  updated  by 

Q?+i(s,o\o^)  = 

(1  -  a,)Q?(s,a‘,a^) +  a,[rf -l-/97r‘(s')Q?(s')7r*(s')I18) 

where  k  =  1,2,  converge  to  the  Nash  equilibrium  Q 
values  (Ql,Ql),  with  Qj  defined  as 

Q!l(s,a\a^)  = 

r'‘(s,  a* ,  a*)  +  ^  u\a^)v'‘(s',  ttJ  ,  tt?),  (19) 

where  (7r^(s'),7r^(s'))  is  a  pair  of  mixed  Nash  equilib¬ 
rium  strategies  for  the  bimatrix  game  (Qj(s'),  Q? (s')), 
function  u*  is  defined  as  in  (8)  and  (9),  and  (7ri,7r^) 
is  a  Nash  equilibrium  solution  for  stochastic  game  F. 

Proof  By  Lemma  3,  ||  PfQ'^  -  PtQ^  ||<  /3  ||  Q*  - 

Qi  11- 

Prom  Lemma  1,  the  sequence 
Qt+i(s,aSa^)  = 

{l-at)Qt  (s>  a‘ .  a^)  +  «t K*  +  /Jtt*  {s')Q‘‘ (s')7r2(s')] 
converges  to 

P(r{=  -h/37riQ*(s')7r2)  =  5]P(s'|s,a\a2) 

^r'^{s,a^,a^)  -|-  {s')Q'‘ {s')'k‘^ (s')^ . 
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Define  T*"  as 
(T*=Q*')(s,a\a")  = 

5:^,  P{s'\s,  a\a^)  (r‘(s,  a\a^)  +  fines' )Q'‘ (s' U\s')) 

Prom  above,  the  sequence  {Q* }  converges  to  T^Q'^. 

It  is  easy  to  show  that  T*’  is  a  contraction  map¬ 
ping.  To  see  this  is  true,  rewrite  as  T^Q^{s)  = 
P{s'\s,a^,a^)PtQ'‘{s).  Since  Pt  is  a  contraction 
mapping  of  and  P(s'|s,a^,a^)  >  0,  T*  is  also  a 
contraction  mapping  of  Q^.  We  proceed  to  show  that 
Qj  defined  in  (19)  is  the  fixed  point  of  T*’.  From  the 
definition  of  T'°,  we  have 

(T*=Qi)(s,a\a") 

=  P{s'\s, «! ,  a^)  {s,  a\a^)  +  I3nl (sOQJ (s')7r2(s')) 

=  r'‘(s,a\a^)  P{s'\s,a^ ,a^)l3nl{s')Qt{s')nl{s') 

By  Theorem  3,  7ri(s')Qj(s')7r*(s')  =  v'‘{s' ,Trl,Trl), 
thus  Qi  =T'‘Q^.  Therefore  the  sequence 

(l-at)Q{'(s,a\a2)  +  at[r4i  +  ^■kIQI{s')'kI\  (20) 

converges  to  T’^Q’l  =  Qt-  By  Lemma  2,  the  sequence 
(18)  converges  to  Qj.  □ 

5.4  Discussions 

First  we  want  to  point  out  the  convergence  result  does 
not  depend  on  the  sequence  of  actions  taken  by  either 
agent.  The  convergence  result  only  requires  that  every 
action  has  been  tried  and  every  state  has  been  visited. 

It  does  not  require  that  agent  1  and  agent  2  agree  on 
the  Nash  equilibrium  of  each  bimatrix  Q-game  during 
the  learning.  In  fact,  agent  1  can  learn  its  optimal 
Q-value  without  any  behavior  assumption  of  agent  2, 
as  long  as  agent  1  can  observe  agent  2’s  immediate 
rewards. 

Second,  the  convergence  depends  on  certain  restric¬ 
tions  on  the  bimatrix  games  during  learning.  This  is 
required  because  Nash  equilibrium  operator  is  usually 
not  a  contraction  operator.  However,  we  can  probably 
relax  the  restriction  by  proving  that  a  Nash  equilib¬ 
rium  operator  is  a  non-expansion  operator.  Then  by 
the  theorem  in  Szepesvari  and  Littman  [16],  the  con¬ 
vergence  is  guaranteed. 

6  Future  work 

There  are  several  issues  we  have  not  addressed  in  this 
paper.  The  first  is  the  equilibrium  selection  problem. 


When  there  exist  multiple  Nash  equilibria,  learning 
one  Nash  equilibrium  strategy  does  not  guarantee  the 
other  agent  will  choose  the  same  Nash  equilibrium. 
Our  future  work  is  to  combine  empirical  estimation  of 
the  other  agent’s  strategy  with  reinforcement  learning 
of  the  Nash  equilibrium  strategy. 

Another  issue  is  related  to  the  action  choice  during 
the  learning.  Even  though  the  multiagent  reinforce¬ 
ment  learning  method  converges,  it  requires  infinite 
trials.  During  the  learning,  an  agent  can  choose  a 
myopic  action  or  other  kinds  of  actions.  If  the  agent 
chooses  the  action  to  maximize  its  current  Q-value,  its 
approach  is  called  greedy  approach.  The  drawback  of 
this  greedy  approach  is  that  the  agent  may  be  trapped 
in  a  local  optimal.  To  avoid  this  problem,  the  agent 
should  explore  other  possible  actions.  However,  there 
is  cost  associated  with  exploration.  By  conducting  ex¬ 
ploration,  an  agent  gives  up  a  better  current  reward. 
In  our  future  work,  we  intend  to  design  an  algorithm 
that  can  handle  exploration  and  exploitation  tradeoff 
in  stochastic  games. 
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Abstract 

Coevolutionary  learning,  which  involves  the 
embedding  of  adaptive  learning  agents  in 
a  fitness  environment  that  dynamically  re¬ 
sponds  to  their  progress,  is  a  potential  so¬ 
lution  for  many  technological  chicken  and 
egg  problems.  However,  several  impediments 
have  to  be  overcome  in  order  for  coevolution¬ 
ary  learning  to  achieve  continuous  progress 
in  the  long  term.  This  paper  presents  some 
of  those  problems  and  proposes  a  framework 
to  address  them.  This  presentation  is  illus¬ 
trated  with  a  case  study:  the  evolution  of 
CA  rules.  Our  application  of  coevolution¬ 
ary  learning  resulted  in  a  very  significant  im¬ 
provement  for  that  problem  compared  to  the 
best  known  results. 


1  Introduction 

A  recurrent  issue  in  the  field  of  machine  learning  is  that 
the  performance  of  a  learning  system  relies  heavily  on 
the  amount  of  knowledge  that  has  been  introduced  by 
the  designer.  This  knowledge  can  be  expressed  in  the 
form  of  an  appropriate  representation,  specific  search 
operators,  a  training  set  which  provides  a  good  gradi¬ 
ent  or  a  special  utility  function.  The  success  of  most 
learning  systems  actually  results  from  all  this  engineer¬ 
ing  effort. 

However,  the  goal  of  machine  learning  is  a  system  that 
can  improve  itself  by  continuously  capturing  and  ex¬ 
ploiting  new  knowledge.  The  framework  which  is  pre¬ 
sented  in  this  paper  to  achieve  such  a  goal  is  based 
on  a  coevolutionary  approach.  An  important  factor  in 
the  performance  of  learning  systems  is  the  design  of  a 
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training  environment.  Usually,  this  training  environ¬ 
ment  is  fixed  and  constructed  by  the  human  designer. 
However,  when  little  knowledge  is  available  about  the 
problem  or  if  this  knowledge  is  difficult  to  introduce 
in  the  training  environment,  learning  can  become  in¬ 
tractable.  The  approach  proposed  in  this  paper  to  get 
round  that  problem  consists  of  coevolving  the  train¬ 
ing  environment  with  a  population  of  learners.  Start¬ 
ing  with  simple  problems,  the  training  environment 
gets  more  challenging  as  learners  are  improving  them¬ 
selves.  Hopefully,  such  a  setup  leads  to  continuous 
progress.  For  the  rest  of  the  paper,  we  define  cotvo¬ 
lutionary  learning  as  a  search  procedure  involving  a 
population  of  learners  coevolving  with  a  population  of 
problems  such  that  continuous  progress  results  from 
this  interaction. 

In  practice,  the  picture  is  not  that  simple.  We  will  dis¬ 
cuss  the  different  issues  that  are  involved  to  achieve  co¬ 
evolutionary  learning  by  considering  a  particular  prob¬ 
lem:  the  discovery  of  cellular  automata  rules  to  im¬ 
plement  a  classification  task.  This  problem  presents 
some  interesting  properties  that  provide  us  with  a  sim¬ 
ple  framework  to  monitor  the  dynamics  of  the  search 
resulting  from  different  setups.  Section  2  describes 
this  problem.  In  section  3,  an  experimental  analysis 
presents  the  different  impediments  to  coevolutionary 
learning  and  a  solution  to  address  them  is  proposed  in 
section  4.  Experimental  results  for  the  classification 
problem  are  presented  in  section  5. 

2  Description  of  the  Problem 

2.1  One-Dimensional  Cellular  Automata 

A  one-dimensional  cellular  automaton  (CA)  is  a  linear 
wrap-around  array  composed  of  N  cells  in  which  each 
cell  can  take  one  out  of  k  possible  states.  A  rule  is 
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Figure  1:  Three  space-time  diagrams  describing  the  evolution  of  CA  states:  in  the  first  two,  the  CA  relaxes  to 
the  correct  uniform  pattern  while  in  the  third  one  it  doesn’t  converge  at  all  to  a  fixed  point. 


Table  1:  Performance  of  different  published  CA  rules  and  a  new  best  rule  for  the  pc  =  1/2  task. 


N 

1  149  1 

1  599 

1  999  1 

Coevolution 

D2LS  rule 

mmMKsmmm 

ABK  rule 

GKL  rule 

■iMWMililiW 

defined  for  each  cell  in  order  to  update  its  state.  This 
rule  determines  the  next  state  of  a  cell  given  its  cur¬ 
rent  state  and  the  state  of  cells  in  a  predefined  neigh¬ 
borhood.  For  the  model  discussed  in  this  paper,  this 
neighborhood  is  composed  of  cells  whose  distance  is  at 
most  r  from  the  central  cell.  This  operation  is  per¬ 
formed  synchronously  for  all  the  cells  in  the  CA.  From 
now  on,  we  will  consider  that  the  state  of  cells  is  bi¬ 
nary  {k  =  2),  N  =  149  and  r  =  3.  This  means  that 
the  size  of  the  rule  space  is  =  2^^®. 

Cellular  automata  have  been  studied  widely  as  they 
represent  one  of  the  simplest  systems  in  which  complex 
emergent  behaviors  can  be  observed.  This  model  is 
very  attractive  as  a  means  to  study  complex  systems 
in  nature.  Indeed,  the  evolution  of  such  systems  is 
ruled  by  simple,  locally-interacting  components  which 
result  in  the  emergence  of  global,  coordinated  activity. 

2.2  The  Majority  Function 

This  is  a  density  classification  task,  for  which  one 
wants  the  state  of  the  cells  of  the  CA  to  relax  to  all 
O’s  or  all  I’s  depending  on  the  density  of  the  initial 
configuration  (IC)  (whether  it  has  more  O’s  or  more 
I’s),  within  a  maximum  of  M  time  steps.  Following 


[Mitchell  et  al.,  1994],  pc  denotes  the  threshold  for  the 
classification  task  (here,  pc  =  1/2),  p  denotes  the  den¬ 
sity  of  I’s  in  a  configuration  and  po  denotes  the  density 
of  I’s  in  the  initial  configuration.  Figure  1  presents 
three  examples  of  the  space-time  evolution  of  a  CA. 
One  with  pq  <  pc  on  the  left  and  another  with  po  >  Pc 
in  the  middle  for  which  the  CA  relaxes  to  the  cor¬ 
rect  configuration.  The  third  one  shows  an  instance 
for  which  the  CA  doesn’t  relax  to  any  the  two  desired 
convergence  patterns.  For  each  diagram,  the  initial 
configuration  is  at  the  top  and  the  evolution  in  time 
of  the  state  of  the  CA  is  represented  downward. 

The  task  Pc  =  1/2  is  known  to  be  difficult.  In  par¬ 
ticular,  it  has  been  proven  that  no  rule  exists  that 
results  in  the  CA  relaxing  to  the  correct  state  for  all 
possible  ICs  [Land  &  Below,  1995].  Indeed,  the  den¬ 
sity  is  a  global  property  of  the  initial  configuration 
while  individual  cells  of  the  CA  have  access  to  local 
information  only.  Discovering  a  rule  that  will  dis¬ 
play  the  appropriate  computation  by  the  CA  with  the 
highest  accuracy  is  a  challenge,  and  the  upper  limit 
for  this  accuracy  is  still  unknown.  Table  1  describes 
the  performance  for  that  task  for  different  published 
rules  and  different  values  of  N ,  along  with  the  perfor¬ 
mance  of  the  new  best  rule  that  resulted  from  the  work 
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presented  in  this  paper.  The  Gacs-Kurdyumov-Levin 
(GKL)  rule  was  designed  in  1978  for  a  different  goal 
than  solving  the  pc  =  1/2  task  [Mitchell  et  ah,  1994]. 
However,  for  a  while  it  provided  the  best  known  per¬ 
formance.  [Mitchell  et  al.,  1994]  and  [Das  et  al.,  1994] 
used  Genetic  Algorithms  (GAs)  to  explore  the  space 
of  rules.  The  main  purpose  of  this  work  was  to 
develop  a  particle-based  methodology  for  the  anal¬ 
ysis  of  the  complex  behaviors  exhibited  by  CAs. 
The  GKL  and  Das  rules  are  human-written  while 
the  Andre-Bennett-Koza  (ABK)  rule  has  been  dis¬ 
covered  using  the  Genetic  Programming  paradigm 
[Andre  et  al.,  1996].  More  recently,  [Paredis,  1997]  de¬ 
scribes  a  coevolutionary  approach  to  search  the  space 
of  rules  and  shows  the  difficulty  of  coevolving  consis¬ 
tently  two  populations  towards  continuous  improve¬ 
ment.  [Capcarrere  et  al.,  1996]  also  reports  that  by 
changing  the  specification  of  the  convergence  pattern, 
a  two-state,  r  =  1  CA  exists  that  can  perfectly  solve 
the  density  problem  in  [A^/2]  time  steps. 

For  the  pc  =  1/2  task,  it  is  believed  that  the  best 
rules  are  in  the  domain  of  the  rule  space  with  density 
close  to  0.5.  An  intuitive  argument  to  support  this 
hypothesis  is  presented  in  [Mitchell  et  al.,  1993].  It  is 
also  believed  that  the  most  difficult  ICs  are  those  with 
density  close  to  0.5. 

3  Models  for  Coevolutionary  Search 

The  idea  of  using  coevolution  in  search  was  introduced 
by  [Hillis,  1992].  In  coevolution,  individuals  are  eval¬ 
uated  with  respect  to  other  individuals  instead  of  a 
fixed  environment  (or  landscape).  As  a  result,  agents 
adapt  in  response  to  other  agents’  behavior.  The  par¬ 
ticular  model  of  coevolution  considered  in  this  paper 
is  based  on  two  populations  for  which  the  fitness  of 
individuals  in  each  population  is  defined  with  respect 
to  the  members  of  the  other  population.  Two  cases 
can  be  considered  in  such  a  framework,  depending  on 
whether  the  two  populations  benefit  from  each  other 
or  whether  they  have  different  interests.  Those  two 
modes  of  interaction  are  called  cooperative  and  com¬ 
petitive  respectively.  In  the  following  sections,  those 
modes  of  interaction  are  described  experimentally  us¬ 
ing  the  Pc  =  1(2  task  in  order  to  stress  the  different 
issues  related  to  coevolutionary  learning. 

For  the  experiments  presented  in  this  section,  we  used 
an  implementation  of  Genetic  Algorithms  similar  to 
the  one  described  in  [Mitchell  et  al.,  1994].  Each  rule 
is  coded  on  a  binary  string  of  length  2^*’’+^  =  128. 
One-point  crossover  is  used  with  a  2%  bit  mutation 


probability.  The  population  size  is  tir  =  200  for  rules 
and  nic  =  200  for  ICs.  The  population  of  ICs  is  com¬ 
posed  of  binary  strings  of  length  N  =  149.  The  pop¬ 
ulation  of  rules  and  ICs  are  initialized  according  to 
a  uniform  distribution  over  [0.0, 1.0]  for  the  density. 
For  all  the  experiments  in  this  paper,  the  value  of  M 
(the  maximum  number  of  time  steps)  is  set  to  320 
and  is  kept  unchanged.  At  each  generation,  the  top 
95%  of  each  population  reproduces  to  the  next  gener¬ 
ation  and  the  remaining  5%  is  the  result  of  crossover 
between  parents  from  the  top  95%  selected  using  a 
fitness  proportionate  rule.  This  small  generation  gap 
(the  percentage  of  new  individuals)  has  been  used  be¬ 
cause  of  the  dynamic  fitness  landscape.  Indeed,  a  large 
generation  gap  can  result  in  a  dramatic  change  in  the 
composition  of  the  population.  As  a  consequence,  be¬ 
cause  of  the  relative  definition  of  the  fitness,  a  lot  of 
variation  in  individuals’  fitness  can  occur  from  one  gen¬ 
eration  to  the  other,  making  the  identification  of  the 
most  promising  individuals  very  unreliable. 

3.1  Cooperation  between  Populations 

In  this  mode  of  interaction,  improvement  on  one  side 
results  in  positive  feedback  on  the  other  side.  As 
a  result,  there  is  a  reinforcement  of  the  relationship 
between  the  two  populations.  From  a  search  point 
of  view,  this  can  be  seen  as  an  exploitative  strategy. 
Agents  are  not  encouraged  to  explore  new  areas  of  the 
search  space  but  only  to  perform  local  search  in  order 
to  further  improve  the  strength  of  the  relationship.  In 
the  cooperative  model,  a  natural  definition  for  the  fit¬ 
ness  of  rules  (resp.  ICs)  is  the  number  of  ICs  (resp. 
rules)  for  which  the  CA  relaxes  to  the  correct  state: 

nic 

f{Rr)  =  ^  covered{RijICj) 

j=^ 

riR 

fm)  =  cover  ed{Ri^ICj) 

i=l 

where  covered{Ri,ICj)  returns  1  if  a  CA  using  rule 
Ri  and  starting  from  initial  configuration  ICj  relaxes 
to  the  correct  state.  Otherwise,  it  returns  0. 

Figure  2  presents  the  evolution  of  the  density  of  rules 
and  ICs  for  one  run  using  this  cooperative  model. 
Without  any  surprise,  the  population  of  rules  and  ICs 
quickly  converge  to  a  domain  of  the  search  space  where 
ICs  are  easy  for  rules  and  rules  consistently  solve  ICs. 
As  a  result,  there  is  little  exploration  of  the  search 
space.  The  convergence  configuration  depends  on  the 
initial  populations,  some  other  runs  ended  up  with  low 
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Figure  2:  Coevolution  of  CA  rules  (left)  and  ICs  (right)  in  a  cooperative  relationship. 


density  rules  and  ICs.  This  experiment  confirms  that 
ICs  with  low  or  high  density  are  the  easiest  to  classify 
since  a  larger  number  of  rules  classify  them  correctly. 

3.2  Competition  between  Populations 

In  this  mode  of  interaction,  the  two  populations  are  in 
conflict.  Improvement  on  one  side  results  in  negative 
feedback  for  the  other  population.  The  fitness  of  rules 
and  ICs  defined  in  the  cooperative  case  can  be  modified 
as  follows  to  implement  the  competitive  model; 

n/c 

f{Ri)  =  coveredjR, ,  ICj ) 
j=i 


nn  _ 

f{ICj)  =  Y^covered{Ri,ICj) 

i=l 

where  covered{Ri,ICj)  returns  the  inverse  of  the  orig¬ 
inal  function.  Here,  the  goal  of  rules  is  to  defeat  (i.e. 
cover)  ICs,  while  the  goal  of  ICs  is  to  defeat  rules  by 
discovering  initial  configurations  that  are  difficult  to 
classify.  Figure  3  describes  a  run  using  this  definition 
of  the  fitness.  Two  kind  of  behaviors  can  be  observed 
in  this  experiment.  In  a  first  stage,  the  two  popula¬ 
tions  exhibit  a  cyclic  behavior.  It  is  a  consequence 
of  the  Red  Queen  effect  [Cliff  &  Miller,  1995]:  fitness 
landscapes  are  changing  as  a  result  of  agents  of  each 
population  adapting  in  response  to  the  evolution  of 
members  of  the  other  population.  The  evaluation  of 
individuals’  performance  in  this  changing  environment 
makes  continuous  progress  difficult.  A  typical  conse¬ 
quence  is  that  agents  have  to  learn  again  what  they 
already  knew  in  the  past.  In  the  context  of  evolu¬ 
tionary  search,  this  means  that  domains  of  the  state 


space  that  have  already  been  explored  in  the  past  are 
searched  again.  Then,  a  stable  state  is  reached:  in 
this  case,  the  population  of  rules  adapts  faster  than 
the  population  of  ICs,  resulting  in  a  population  focus¬ 
ing  only  on  rules  with  high  density  and  eliminating 
all  instances  of  low  density  rules  (a  finite  population 
is  considered).  Then,  low  density  ICs  exploit  those 
rules  and  overcome  the  entire  population.  A  similar 
experiment  is  described  in  [Paredis,  1997). 


3.3  Resource  Sharing  and  Mediocre  Stable 
States 

Several  techniques  have  been  designed  to  improve 
evolutionary  search.  Usually  they  maintain  diver¬ 
sity  in  the  population  in  order  to  avoid  premature 
convergence.  [Mahfo\id,  1995]  presents  different  nich¬ 
ing  techniques  that  achieve  this  goal.  Rc'sourcc' 
sharing,  first  introduced  in  [Rosin  &  Bt'lew,  199.5],  is 
a  technique  that  we  successfully  used  in  the  past 
[.Juille  &  Pollack.  199C].  Resource  sharing  implements 
a  coverage-based  heuristic  by  giving  a  higher  payoff 
to  problems  that  few  individuals  can  solve.  Resource 
sharing  can  be  introduced  in  the  competitive  model  of 
coevolution  as  follows: 


nrc 

f{R,)  =  ^^  v’cightJCj  X  covrrrd{R, ,  IC j) 


where: 


rrrightJCj 


1 

EI'=i  covcrcd[RkJC^ 
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Figure  3:  Coevolution  of  CA  rules  (left)  and  ICs  (right)  in  a  competitive  relationship. 


and 


nR  _ 

foci)=T.  weight-Ri  x  covered{Ri,ICj) 

i=l 


where: 


weight  Jti  = 


_ 1 _ 

YlV=i  covered{Ri,ICk) 


In  this  definition,  the  weight  of  an  IC  corresponds  to 
the  payoff  it  returns  if  a  rule  covers  it.  If  few  rules  cover 
an  IC,  this  weight  will  be  much  larger  than  if  a  lot  of 
rules  cover  that  same  IC.  The  definition  for  the  weight 
of  rules  has  the  same  purpose.  This  framework  allows 
the  presence  of  multiple  niches  (or  species)  in  popula¬ 
tions.  Figure  4  describes  one  run  for  this  definition  of 
the  fitness.  The  cyclic  behavior  which  was  observed  in 
the  previous  section  doesn’t  occur  anymore.  Instead, 
two  species  coexist  in  the  population  of  rules:  a  species 
for  low  density  rules  and  another  one  for  high  density 
rules.  Those  two  species  drive  the  evolution  of  ICs 
towards  the  domain  of  initial  configurations  that  are 
most  difficult  to  classify  (i.e.,  po  =  1/2).  However,  the 
two  populations  have  entered  a  mediocre  stable  state. 
This  means  that  multiple  average  performance  niches 
coexist  in  both  populations  in  a  stable  manner.  Put  in 
another  way,  this  can  be  seen  as  an  equilibrium  config¬ 
uration  in  which  a  number  of  suboptimal  species  have 
found  a  way  to  collude  by  sharing  the  total  credit  be¬ 
tween  themselves.  Usually,  this  is  a  consequence  of 
some  singularities  inherent  in  the  problem  definition 
and/or  the  search  procedure.  In  our  example,  ICs  are 
concentrated  around  the  po  =  1/2  threshold  and  they 
can  be  divided  into  two  groups:  those  with  density 
Po  <  1/2  and  those  with  density  po  >  1/2.  This  dis¬ 
tribution  means  that  ICs  can  be  exploited  consistently 


by  rules  with  low  and  high  density  that  both  occur  in 
the  second  population  (because  a  CA  implementing 
a  low  (resp.  high)  density  rule  usually  relaxes  to  all 
O’s  (resp.  all  I’s)  for  most  ICs).  However,  this  is  a 
mediocre  stable  state  in  the  sense  that  evolved  rules 
have  poor  performance  with  respect  to  the  pc  =  1/2 
task  and  there  is  no  pressure  towards  improvement. 
The  concept  of  mediocre  stable  states  is  also  discussed 
in  [Pollack  et  al.,  1996]. 


3.4  Discussion 

We  have  described  different  models  for  the  coevolu¬ 
tion  of  two  populations.  Some  of  the  fundamental  im¬ 
pediments  to  coevolutionary  learning  have  been  iden¬ 
tified  along  with  some  of  the  reasons  why  continuous 
progress  is  difficult  to  achieve.  It  is  now  clear  that  none 
of  these  approaches  can  address  successfully  the  prob¬ 
lem  of  coevolutionary  learning  alone.  All  the  rules  dis¬ 
covered  in  those  experiments  perform  poorly  since  they 
never  approach  the  50%  density.  The  following  section 
proposes  a  framework  to  get  around  those  problems. 

Each  of  the  canonical  models  discussed  so  far  imple¬ 
ments  a  single  specific  strategy.  In  the  literature,  there 
has  been  some  successful  applications  for  both  the  co¬ 
operative  and  the  competitive  approaches.  However, 
those  works  usually  introduce  some  mechanisms  to  ad¬ 
dress  the  problems  specific  to  each  model.  For  in¬ 
stance,  a  noisy  evaluation  of  the  fitness  can  force  ex¬ 
ploration  in  a  cooperative  model,  and  an  evaluation 
of  individuals  with  respect  to  a  set  of  opponents  ex¬ 
tracted  from  previous  generations  can  limit  the  cyclic 
behavior  observed  in  competitive  models  (e.g.,  see 
the  life-time  fitness  evaluation  technique  described  in 
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Figure  4:  Coevolution  of  CA  rules  (left)  and  ICs  (right)  in  a  competitive  relationship  with  resource  sharing. 


[Paredis,  1996]  or  the  “hall  of  fame”  method  presented 
in  [Rosin,  1997]).  However,  those  mechanisms  usu¬ 
ally  fail  to  address  entirely  the  fundamental  issues  dis¬ 
cussed  previously. 

4  Coevolving  the  “Ideal”  Trainer 

4.1  Presentation  of  our  Approach 

From  the  analysis  of  the  experiments  presented  in  sec¬ 
tion  3  at  least  two  reasons  seem  to  prevent  continu¬ 
ous  progress  in  coevolutionary  search.  The  first  one  is 
that  the  training  environment  provided  by  the  popu¬ 
lation  of  ICs  returns  little  information  to  the  popula¬ 
tion  of  evolving  rules  because  a  stable  configuration  is 
reached  in  which  the  credit  is  distributed  according  to 
a  fixed  pattern  (e.g.,  all  the  ICs  are  covered  by  rules). 
The  second  reason  is  that  the  dynamics  of  the  search 
performed  by  the  two  coevolving  populations  doesn’t 
drive  individuals  to  the  domain  of  the  state  space  that 
contains  most  promising  solutions  because  there  is  no 
“high-level”  strategy  to  play  that  role.  This  is  a  con¬ 
sequence  of  the  Red  Queen  effect. 

Our  approach  proposes  a  coevolutionary  framework  in 
which  those  two  issues  are  addressed  as  follows: 

•  the  training  environment  provides  at  any  time  a 
gradient  for  search  by  proposing  a  variety  of  prob¬ 
lems  covering  a  range  of  difficulty.  Indeed,  if  prob¬ 
lems  are  too  difficult,  nobody  can  solve  them.  On 
the  contrary,  if  they  are  too  easy,  everybody  can 
solve  them.  In  both  cases,  those  problems  are  use¬ 
less  for  learning  since  they  provide  little  feedback. 

•  a  “high-level”  strategy  allows  continuous  progress 


by  preventing  the  negative  effects  associated  with 
the  Red  Queen. 

The  central  idea  of  this  coevolutionary  learning  ap¬ 
proach  consists  in  exposing  learners  to  problems  that 
are  just  beyond  those  they  know  how  to  solve.  By 
maintaining  this  constant  pressure  towards  slightly 
more  difficult  problems,  a  arms  race  among  learners  is 
induced  such  that  learners  that  adapt  better  have  an 
evolutionary  advantage.  The  underlying  heuristic  im¬ 
plemented  by  this  arms  race  is  that  ndnpt.nJnlit.y  is  the 
driving  force  for  improvement.  The  difficulty  resides  in 
the  accurate  implementation  of  the  conrei)ts  presented 
above  in  a  search  algorithm.  So  far,  our  methodology 
to  implement  siich  a  system  consists  in  the  construc¬ 
tion  of  an  explicit  topology  over  the  space  of  problems 
by  defining  a  partial  order  with  respect  to  the  relative 
difficulty  of  problems  among  each  other.  In  our  cur¬ 
rent  work,  the  concept  of  “relative  difficulty”  has  been 
defined  by  exploiting  some  a  priori  knowledge  about 
the  task.  The  definition  of  this  topology  over  the  space 
of  problems  makes  possible  the  implementation  of  the 
two  goals  required  in  our  coevolutionary  learning  ap¬ 
proach.  Indeed,  since  learners  are  evaluated  against  a 
known  range  of  difficulty  for  problems,  it  is  i)ossible  to 
monitor  their  progress  and  to  expose  them  to  ])roblems 
that  are  just  “a  little  more  difficult”.  In  our  work,  this 
last  concept  has  been  formalized  by  defining  ('inpiri- 
cally  a  distance  measure.  In  this  framework,  learners 
are  always  exposed  to  a  gradient  for  search  and  it  is 
possible  to  control  the  evolution  of  the  training  envi¬ 
ronment  towards  more  difficult  problems  in  ordc'r  to 
ensure  continuous  progress. 

In  the  future,  our  goal  is  to  eliminate  some  of  those  ex- 
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plicit  components  by  introducing  some  heuristics  that 
automatically  identify  problems  that  are  appropriate 
for  the  current  set  of  learners.  The  work  of  Rosin 
[Rosin,  1997]  already  describes  some  methods  to  ad¬ 
dress  this  issue. 


4.2  Discussion 

As  stated  previously,  the  coevolutionary  learning 
framework  introduces  a  pressure  towards  adaptabil¬ 
ity.  The  central  assumption  is  that  individuals  that 
adapt  faster  than  others  in  order  to  solve  the  new  chal¬ 
lenges  they  are  exposed  to  are  also  more  likely  to  solve 
even  more  difficult  problems.  The  main  difficulty  is 
to  setup  a  coevolutionary  framework  that  implements 
this  heuristic  accurately  and  efficiently. 

The  new  contribution  of  this  work  is  the  idea  of  main¬ 
taining  a  gradient  for  search  as  one  of  the  underlying 
heuristics.  In  the  literature,  different  approaches  have 
been  proposed  to  address  the  issues  associated  with  the 
Red  Queen  effect  [Paredis,  1996,  Rosin,  1997].  How¬ 
ever,  to  our  knowledge,  explicit  methods  to  force 
progress  and  to  prevent  mediocre  stable  states  in  the 
context  of  evolutionary  search  have  never  been  tried. 

The  idea  of  introducing  a  pressure  towards  adapt¬ 
ability  as  the  central  heuristic  for  search  is  not  new. 
Schmidhuber  [Schmidhuber,  1995]  proposed  the  Incre¬ 
mental  Self-Improvement  system  in  which  adaptabil¬ 
ity  is  the  measure  that  is  optimized.  The  concept 
of  an  ideal  trainer  is  also  discussed  in  [Epstein,  1994] 
in  the  context  of  game  learning.  However,  this  work 
addresses  the  issue  of  designing  the  “ideal”  training 
procedure  which  would  result  in  high  quality  players 
rather  than  coevolving  the  training  environment  in  re¬ 
sponse  to  the  progress  of  learners. 


5  Application  to  the  Discovery  of  CA 
Rules 

5.1  Experimental  Setup 

The  approach  described  in  the  previous  section  is  ap¬ 
plied  to  the  pc  =  1/2  task.  It  is  believed  that  ICs 
become  more  and  more  difficult  to  classify  correctly  as 
their  density  gets  closer  to  the  pc  threshold.  Therefore, 
our  idea  is  to  construct  a  framework  that  adapts  the 
distribution  of  the  density  for  the  population  of  ICs 
as  CA-rules  are  getting  better  to  solve  the  task.  The 
following  definition  for  the  fitness  of  rules  and  ICs  has 


been  used  to  achieve  this  goal. 

nic 

f{Ri)  =  '^weight  JCj  x  covered{Ri,  ICj) 
1=1 


where: 


weight  JCj  = 


X]fc=i  covered{Rk,ICj 


and 


71 J?. 

f{ICj)  =  Y^weightJil  x  E{Ri,p{ICj))  x 

i=l 

covered{Ri,  ICj) 


where: 
weight  Jl[ 


1 

Y:2=i  E{Ri,p{ICk))  X  ^^^diRiJCk) 


This  definition  implements  the  competitive  relation¬ 
ship  with  resource  sharing.  However,  a  new  compo¬ 
nent,  namely  E{Ri, p{ICj)),  has  been  added  in  the 
definition  of  the  ICs’  fitness.  The  purpose  of  this  new 
component  is  to  penalize  ICs  with  density  p{ICj)  if 
little  information  is  collected  with  respect  to  the  rule 
Ri-  Indeed,  we  consider  that  if  a  rule  Ri  has  a  50% 
classification  accuracy  over  ICs  with  density  p{ICj) 
then  this  is  equivalent  to  random  guessing  and  no  pay¬ 
off  should  be  returned  to  ICj.  On  the  contrary,  if 
the  performance  of  Ri  is  significantly  better  or  worse 
than  the  50%  threshold  for  a  given  density  of  ICs  this 
means  that  Ri  captured  some  relevant  properties  to 
deal  with  those  ICs.  Once  again,  the  idea  is  that  the 
training  environment  should  be  composed  of  ICs  that 
provide  useful  information  to  identify  good  rules  from 
poor  ones.  In  order  to  allow  continuous  progress,  our 
implementation  exploits  an  intrinsic  property  of  the 
pc  =  1/2  task.  Indeed,  it  seems  that  CA-rules  that 
cover  ICs  with  density  po  <  1/2  (resp.  po  >  1/2)  with 
high  performance  will  also  be  very  successful  over  ICs 
with  density  pj,  <  po  (resp.  p'^  >  po).  Therefore,  as 
ICs  become  more  difficult,  their  density  is  approach¬ 
ing  po  =  1/2  but  rules  don’t  have  to  be  tested  against 
easier  ICs.  Following  this  idea,  we  defined  E{)  as  the 
complement  of  the  entropy  of  the  outcome  between  a 
rule  and  ICs  with  a  given  density: 


E{Ru  p{ICj))  =  log(2)  -t-  plog(p)  +qlog{q) 

where:  p  is  the  probability  that  an  IC  with  density 
p{ICj)  defeats  the  rule  Ri  and  q  =  1  —  p.  E{)  imple¬ 
ments  the  distance  measure  discussed  in  section  4.1. 
Its  purpose  is  to  maintain  the  balance  between  the 
search  for  more  difficult  ICs  and  ICs  that  can  be  solved 
by  rules.  In  practice,  the  entropy  is  evaluated  by  per¬ 
forming  some  statistics  over  the  population  of  ICs. 
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Table  2:  Dcscri])tion  of  the 

current  Irest 

iad('  and  jeublished  ru 

es  for  th(' 

/V  =  1/2  ta 

sk. 

Coevolution 

00010100 

00010100 

01011111 

01011111 

01000000 
0000001 1 

OOOOOOOO 

00001111 

000101 1 1 
00010111 

1 1  1 1 1  10(1 
11111111 

00000010 

11111111 

000101 1 1 

1 10101 1 i 

Das  rule 

00000111 

00001111 

00000000 

00000000 

000001 1 1 
00000111 

11111111 

11111111 

00001 1 1 1 
00001111 

oooooooo 
001 10001 

00001 1 1 1 
000011 1 1 

1 1 1 1 1 1 1 1 

1 1 1 1 1 1 1 1 

ABK  rule 

00000101 

01010101 

00000000 

11111111 

01010101 

01010101 

00000101 

11111111 

00000101 

01010101 

oooooooo 

11111111 

01010101 

0 1 0 1 0 1 0 1 

00000101 

1 1 1 1 1 1 1 1 

GKL  rule 

00000000 

00000000 

01011111 

01011111 

oooooooo 

11111111 

0101 1 1 1 1 
01011111 

oooooooo 

oooooooo 

01011 1 1 1 
01011111 

oooooooo 

1 1 1 1 1 1 1 1 

0101  1  11  1 

0 1 0 1  1  1  1 1 

5.2  Experimental  Results 

Experiments  were  performed  witli  different  sizes  for 
the  population  of  rules  and  ICs.  The  host  rule  whose 
performance  is  reported  in  table  1  resulted  from  the 
experiments  that  used  the  largest  population  size.  In 
those  experiments,  6  runs  were  i)erformed  for  5,000 
generations,  using  a  size  of  1,000  for  the  two  pop¬ 
ulations,  Each  rule  is  coded  on  a  binary  string  of 
length  2'^*^+^  =  128,  One-point  crossover  is  used  with 
a  2%  bit  mutation  probability.  The  population  of  ruh's 
is  initialized  according  to  a  uniform  distribution  over 
[0,0, 1.0]  for  the  density.  Each  individual  in  the  popu¬ 
lation  of  ICs  represents  a  density  po  £  [0.0. 1.0],  This 
population  is  also  initialized  according  to  a  uniform 
distribution  over  po  £  [0.0, 1.0],  At  each  generation, 
each  member  generates  a  new  instance  for  an  initial 
configuration  with  respect  to  the  density  it  rej)resents. 
All  rules  are  evaluated  against  this  new  set  of  IC's.  The 
generation  gap  is  5%  for  the  population  of  ICs  (i.e..  the 
top  95%  ICs  reproduce  to  the  next  generation).  There 
is  no  crossover  nor  mutation.  The  new  5%.  ICs  are 
the  result  of  a  random  sampling  over  po  £  [0.0. 1.0] 
according  to  a  uniform  probability  distribution.  The 
generation  gap  is  80%  for  the  population  of  rules.  New 
rules  are  created  by  crossover  and  mutation.  Parents 
are  randomly  selected  from  the  top  20%.  All  runs 
consistently  evolved  some  rules  that  score  above  82%  . 
Table  2  describes  lookup  tables  for  the  current  best 
CA  rule  and  other  rules  discussed  in  the  literature. 
The  leftmost  bit  corresponds  to  the  result  of  the  rule 
on  input  0000000,  the  second  bit  corresponds  to  in¬ 
put  0000001,  . . .  and  the  rightmost  bit  corresponds  to 
input  1111111. 

Figure  5  describes  the  evolution  of  the  density  of  rules 
and  ICs  for  om^  run.  As  rules  improve,  their  density 
gets  closer  to  1/2  and  the  density  of  ICs  is  distributed 
on  two  peaks  on  each  side  of  pr  =  1/2.  In  that  par¬ 
ticular  run,  it  is  only  after  1,300  generations  that  a 
significant  improvement  is  observed  for  rules  and  that, 
in  response,  the  population  of  ICs  adapts  dramatically 


in  order  to  jirotiose  nioia'  challenging  initial  configura¬ 
tions.  This  shows  that  our  stratr'gy  to  co('volv('  tin' 
training  environment  and  tlu'  h'ainers  has  be('n  suc¬ 
cessfully  imidementr'fl  in  tlu'  (hdinition  of  the  fitness 
functions. 

6  Conclu.sion 

This  paiier  ])resents  a  ik'w  framework  based  on  tlu' 
conre|)t  of  corrobifimtmi/  IraTnuii/.  d'his  a])i)r()ach  cof'- 
volves  the  training  environment  with  resp('ct  to  a  poji- 
ulation  of  learners  such  that  h'ariu'rs  art'  always  ('x- 
posed  to  a  gradient  for  search,  and  e\’olution  of  ])rob- 
lems  towards  increasing  diflicully  is  maintained.  Tin' 
work  present(>d  in  this  ])a])er  addresses  those  issues 
by  defining  a  topology  ov('r  the  s])ac('  of  jirobhuiis. 
Then,  a  ])rocedure  is  imi)leiiK'nt('fl  such  that  tin'  train¬ 
ing  environment  automatically  adajjts  in  la'sponsc'  t.o 
the  progress  of  learners  by  ])ro])osing  mor('  challenging 
probh'ins.  W’o  a])]>li('d  this  framework  to  th<'  i)rob- 
h'ln  of  evolving  CA  riih's  for  a  classification  task.  Oni- 
ex])eriinents  resulterl  in  a  lu'w  rule  whosi'  lu'rformance 
im]uoves  very  significantly  ov('r  jjia'viously  known  rules 
for  that  ])articular  task. 
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Abstract 

We  present  new  algoritlinis  for  reinforce¬ 
ment  learning,  and  prove  tliat  they  have 
polynomial  bounds  on  tlu'  resources  recpiircd 
to  achieve  near-optimal  return  iti  general 
Markov  decision  processes.  After  obs('rving 
that  the  number  of  actions  reciuircfl  to  aj)- 
proach  the  optimal  return  is  lower  bounded 
by  the  mixing  time  T  of  the  o])tiinal  policy 
(in  the  undisconnted  cas<')  or  by  the  lutrizon 
time  T  (in  the  tliscounted  case),  we  then  gi\'e 
algorithms  requiring  a  nutnber  of  actions  and 
total  comptitation  time  that  are  oidy  jtoly- 
nomial  in  T  and  the  number  of  st;>t('s.  for 
both  the  undisconnted  and  discountcfl  cases. 

An  interesting  aspect  of  our  algorithms  is 
their  explicit  handling  of  the  Ex])loration- 
Exploitation  trade-off. 

1  Introduction 

In  reinforcement  learning,  an  agent  interacts  with  an 
unknown  environment,  and  attempts  to  choose  actions 
that  maximize  its  cumulative  jiayoff  (.Sutton  A’  Barto. 
1998;  Barto  ct  ah,  1990,  Bertsekas  A'  Tsitsiklis.  1990). 
The  environment  is  typically  mothded  as  a  Markov  ile- 
dsion  process  (MDP),  and  it  is  assumed  that  the  agent 
does  not  know  the  parameters  of  this  process,  but  has 
to  learn  how  to  act  directly  from  exiierienc-e.  Thus, 
the  reinforcement  learning  agent  faces  a  fundamental 
trade-off  between  exjdoitntion  aiul  e.xplnration  (Thrun. 
1992;  Sutton  &  Barto,  1998):  should  the  agent  exjiloit 
its  cumulative  ex])erience  so  f;ir,  by  ex('cuting  the  ac¬ 
tion  that  currently  seems  best,  or  should  it  execute  a 
different  action,  with  the  hop('  of  gaining  information 
or  experience  that  could  lead  to  higluu'  future  i)ayoffs'.’ 
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Too  little  exploration  ran  pri'vent  the  agent  from  ('ver 
converging  to  tlu'  optimal  bt'havior,  whih'  too  much 
exi)loratif)n  ran  ])rev('nt  the  agent  fiom  gaining  lu'ar- 
optima!  i)ayoff  in  a  tinudy  fashion. 

There'  is  a  large  lit('r;ttur('  on  n'inforcement  learning, 
whiedi  h;is  been  growing  ra])idly  in  the  last  decade'.  Tei 
tilt'  be'st  of  our  kneewh'elge'.  all  pre-vieeus  re'sults  eui  re'in- 
feere-e'ine'iit  le-arning  in  general  MDP's  are' asympteet ie'  in 
nature,  jerovieiing  net  explie  it  giiarante'e's  em  e'ithe'r  the' 
number  of  ae-tieins  eer  the'  e'eempulalieen  time'  the'  tige'iil 
ree|uire'S  tee  achie've  near-e)])tiimil  pe'rfeirimince'  (Sul  teen. 
1988;  Watkins  A'  Dawtn.  1992;  .betikkeela  e't  ;tl..  1991; 
Tsitsiklis,  1991:  Gulhipalli  A'  Btirtt),  1991).  Oti  the' 
otlit'r  htinel,  finite'-tiiiK'  restilts  bee'eeme'  jivaihibh'  if  euie' 
consiele'fs  restricted  classe's  eef  .MDP's.  if  the'  meiele'l  etf 
le'terning  is  moflifie'el  freem  the'  sttenelarel  eeiie',  eer  if  eiiie' 
changes  the'  crite'iiti  for  sttee'e'ss  (S.'iul  A'  Singh.  1990: 
Fie'ehte'i'.  1991:  Fieehte'f,  1997;  Se'hajeire'  ,C  Warmuth, 
1991:  Singh  A'  Dttyan.  in  pre'ss).  Fie'e  hte'f  (1991,1997), 
wheesc'  reseilts  are  e'letse'st  in  sjeirit  tee  eeurs,  e'euisieh'rs 
only  the'  dise'eeunte'el  e'tese'.  anel  make's  the'  le'tirnitig  pree- 
teie'ol  easier  by  asstiming  the'  availability  eif  ;i  "re'se't” 
btitteui  that  alleiws  the'  age'iit  te)  re'ttirn  tei  a  fixe'il  se't  eif 
stJirt  st;ite's  at  any  time'. 

Thus,  de'sieite'  the'  m;iny  inte'ie'sting  pre'\'ious  re'sults  in 
re'inforcemeut  le'arning,  the'  lite'ftit ure'  has  lae'ke'el  tilgei- 
rithms  frer  h'arning  eiiitiimd  be'havieir  in  general  .MDP's 
with  ])re)vably  finite  beiunils  eiti  the-  re'seiuree's  (ae'tieins 
anel  comimtation  time)  re'eiuire'el,  uiieh'r  the'  stanelarel 
meielel  of  learning  in  which  the'  age'iit  wanele'is  e'ontin- 
uously  in  the'  unkneiwn  envireuime'nt .  d  he'  re'sttlts  pre'- 
sented  in  this  ptipe'r  fill  this  veiirl  in  wh;it  is  e'sse'iit iaily 
the'  strongest  jiossible'  se'iise'. 

We'  prese'iit  ne'w  algenithms  feer  re'infeue'e'ine'nt  le'tuii- 
iug.  anel  prove'  that  tlu'V  Inoe'  pnignoinial  beiuiiels  ein 
the  resoure'cs  reeniire'd  tei  achie've'  iie'ar-e)ptitn;il  payeifl 
in  genernl  MDP's.  The'  boiinels  ;ire'  peilynomial  in  the' 
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number  of  states,  and  also  in  the  mixing  time  of  the  op¬ 
timal  policy  (undiscounted  case),  or  the  horizon  time 
1/(1  —  7)  (discounted  case).  One  of  the  contributions 
of  this  work  is  in  simply  identifying  the  fact  that  finite¬ 
time  convergence  results  must  depend  on  these  param¬ 
eters  of  the  underlying  MDP.  An  interesting  aspect  of 
our  algorithms  is  their  rather  explicit  handling  of  the 
exploration-exploitation  trade-off. 

For  lack  of  space,  here  we  present  only  our  re¬ 
sults  for  the  more  difficult  undiscounted  case.  The 
analogous  results  for  the  discounted  case  are  cov¬ 
ered  in  a  forthcoming  longer  paper;  interested  read¬ 
ers  can  retrieve  the  latest  version  from  the  web  page 
http://www.research.att.com/~mkearns. 

2  Preliminaries  and  Definitions 

We  begin  with  the  basic  definitions  for  MDP’s. 

Definition  1  A  Markov  decision  process  (MDP) 
M  on  states  1,...,A'’  and  with  actions  ai,...,ak, 
consists  of: 

Transition  probabilities  Pfy{ij)  >  0,  which  for  any 
action  a,  and  any  states  i  and  j,  specify  the  probability 
of  reaching  state  j  after  executing  action  a  from  state 
i  in  M.  Thus,  P^{ij)  =  1  for  any  state  i  and  ac¬ 
tion  a. 

Payoflf  distributions,  for  each  state  i,  with  mean 
Ruii)  (where  Rmax  >  RM{i)  >  0),  and  variance 
VarM{i)  <  Varmax-  These  distributions  determine  the 
random  payoff  received  when  state  i  is  visited. 

For  simplicity,  we  will  assume  that  the  number  of  ac¬ 
tions  A:  is  a  constant;  it  will  be  easily  verified  that  if  k  is 
a  parameter,  the  resources  required  by  our  algorithms 
scale  polynomially  with  k. 

Several  comments  regarding  some  benign  technical  as¬ 
sumptions  that  we  will  make  on  payoffs  are  in  order 
here.  First,  it  is  common  to  assume  that  payoffs  are 
actually  associated  with  state-action  pairs,  rather  than 
with  states  alone.  Our  choice  of  the  latter  is  entirely 
for  technical  simplicity,  and  all  of  the  results  of  this 
paper  hold  for  the  standard  state-action  payoffs  model 
as  well.  Second,  we  have  assumed  fixed  upper  bounds 
Rmax  and  Varmax  on  the  means  and  variances  of  the 
payoff  distributions;  such  a  restriction  is  necessary  for 
finite-time  convergence  results.  Third,  we  have  as¬ 
sumed  that  expected  payoffs  are  always  non-negative 
for  convenience,  but  this  is  easily  removed  by  adding 
the  minimum  expected  payoff  to  every  payoff. 


If  M  is  an  MDP  over  states  and  with  ac¬ 

tions  ai,...,ak,  a  policy  in  M  is  a  mapping  tt  : 

{fli, . . . ,  Ojfc}.  An  MDP  M,  combined 
with  a  policy  n,  yields  a  standard  Markov  process  on 
the  states,  and  we  will  say  that  tt  is  ergodic  if  the 
Markov  process  resulting  from  tt  is  ergodic  (that  is, 
has  a  well-defined  stationary  distribution).  For  the 
development  and  exposition,  it  will  be  easiest  to  con¬ 
sider  MDP’s  for  which  every  policy  is  ergodic,  the  so- 
called  unichain  MDP’s  (Puterman,  1994).  Consider¬ 
ing  the  unichain  case  simply  allows  us  to  discuss  the 
stationary  distribution  of  any  policy  without  cumber¬ 
some  technical  details,  and  as  it  turns  out,  the  result 
for  unichains  already  forces  the  main  technical  ideas 
upon  us.  Also,  note  that  the  unichain  assumption  does 
not  imply  that  every  policy  will  eventually  visit  every 
state,  or  even  that  there  exists  a  single  policy  that 
will  do  so  quickly;  thus,  the  exploration-exploitation 
dilemma  remains  with  us  strongly.  We  discuss  the  ex¬ 
tension  to  the  multichain  case  in  the  longer  version  of 
this  paper. 

If  M  is  an  MDP,  then  a  T-path  in  M  is  a  se¬ 
quence  p  of  T  -b  1  states  (that  is,  T  transitions)  of 
M:  p  =  ii,i2,  ■  ■  ■  ,iT,iT+i-  The  probability  that  p  is 
traversed  in  M  upon  starting  in  state  ii  and  executing 
policy  TT  is  Prlflp]  =  Li'k=iPM’'‘‘\ikik+i)-  The  (ex¬ 
pected)  undiscounted  return  along  pin  M  is  Um{p)  = 
(l/T)(J?ii  +  •  •  •  +  Rir)  the  T-step  undiscounted 
return  from  state  i  is  Ulj{i,T)  =  Sp PrM[p]*^M(p), 
where  the  sum  is  over  all  T-paths  p  in  M  that  start 
at  i.  We  define  Uf^{i)  =  limT-^ooU^{i,T).  Since 
we  are  in  the  unichain  case,  U]([{i)  is  independent 
of  i,  and  we  will  simply  write  U^.  Furthermore, 
we  define  the  optimal  T-step  undiscounted  return 
from  i  in  M  by  Ul^(i,T)  =  max,r{D’^(i,T)}.  Also, 
Ulf{i)  =  liniT-yoo  UM{i,T).  Finally,  we  observe  that 
the  maximum  possible  T-step  return  is  Rmax- 

3  Mixing  Times  for  Policies 

It  is  easy  to  see  that  if  we  are  seeking  results  about  the 
undiscounted  return  of  a  learning  algorithm  after  a  fi¬ 
nite  number  of  steps,  we  need  to  take  into  account 
some  notion  of  the  mixing  times  of  policies  in  the 
MDP.  To  put  it  simply,  for  finite-time  results,  there 
may  no  longer  be  an  unambiguous  notion  of  “the” 
optimal  policy.  There  may  be  some  policies  which 
will  eventually  yield  high  return  (for  instance,  by  fi¬ 
nally  reaching  some  remote,  high-payoff  state),  but 
take  many  steps  to  approach  this  high  return,  and 
other  policies  which  yield  lower  asymptotic  return  but 
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Figure  1:  A  simple  Markov  process  demonstrating  that 
finite-time  convergence  results  must  account  for  mix¬ 
ing  times. 

higher  short-term  return.  Such  policies  are  simply  in¬ 
comparable,  and  the  best  we  could  hope  for  is  an  al¬ 
gorithm  that  “competes”  favorably  with  any  policy,  in 
an  amount  of  time  that  is  comparable  to  the  mixing 
time  of  that  policy. 

Definition  2  Let  M  be  an  MDP,  and  let  tt  be  an  er- 
godic  policy  in  M .  Then  the  e-return  mixing  time 
of  TT  is  the  smallest  T  such  that  for  all  T'  >  T, 
\U^{i,r)-U]lj\<efor  alli\ 

Suppose  we  are  simply  told  that  there  is  a  policy  tt 
whose  asymptotic  return  Ulj  exceeds  R  in  an  un¬ 
known  MDP  M,  and  that  the  e-return  mixing  time 
of  TT  is  T.  In  principle,  a  sufficiently  clever  learning 
algorithm  (for  instance,  one  that  managed  to  discover 
TT  “quickly”)  could  achieve  return  close  to  —  e  in 
not  much  more  than  T  steps.  Conversely,  without  fur¬ 
ther  assumptions  on  M  or  tt,  it  is  not  reasonable  to 
expect  any  learning  algorithm  to  approach  return  Uh 
in  many  fewer  than  T  steps.  This  is  simply  because 
it  may  take  the  assumed  policy  tt  itself  on  the  order 
of  T  steps  to  approach  its  asymptotic  return.  For  ex¬ 
ample,  suppose  that  M  has  just  two  states  and  only 
one  action  (see  Figure  1):  state  0  with  payoff  0,  self¬ 
loop  probability  1  —  A,  and  probability  A  of  going  to 
state  1;  and  absorbing  state  1  with  payoff  iZ  >>  0. 
Then  for  small  e  and  A,  the  e-return  mixing  time  is  on 
the  order  of  1/A;  but  starting  from  state  0,  it  really 
will  require  on  the  order  of  1/A  steps  to  reach  the  ab¬ 
sorbing  state  1  and  start  approaching  the  asymptotic 
return  R.  (A  more  formal  lower  bound  along  the  lines 
of  this  argument  will  be  given  in  the  long  version.) 


Tn  the  long  version,  we  relate  the  notion  of  e- return 
mixing  time  to  the  standard  notion  of  mixing  time  to  sta¬ 
tionary  distributions  (Puterman,  1994).  The  important 
point  here  is  that  the  r-return  mixing  time  is  polynomially 
bounded  by  the  standard  mixing  time,  but  may  in  some 
cases  be  substantially  smaller. 


Thus,  we  would  like  a  learning  algorithm  such  that  for 
any  T,  in  a  number  of  actions  that  is  polynomial  in 
T,  the  return  of  the  learning  algorithm  is  close  to  that 
achieved  by  the  best  policy  among  those  that  mix  in 
time  T.  This  motivates  the  following  definition. 

Definition  3  Let  M  be  a  Markov  decision  process. 
We  define  11^/  to  be  the  class  of  all  ergodic  policies 
TT  in  M  whose  e-return  mixing  time  is  at  most  T .  We 
let  opt(U^'/)  denote  the  optimal  expected  asymptotic 
undiscounted  return  among  all  policies  in  11^^^  . 

Our  goal  in  the  undiscounted  case  will  be  to  compete 
with  the  policies  in  11^/  in  time  that  is  polynomial 
in  T,  1/c  and  N.  We  will  eventually  give  an  algo¬ 
rithm  that  meets  this  goal  for  every  T  and  e  simulta- 
neou.sly.  An  interesting  special  case  is  when  T  =  T*, 
w'here  T*  is  the  e-mixing  time  of  the  asymptotically 
optimal  policy,  whose  asymptotic  return  is  U*.  Then 
in  time  polynomial  in  T* ,  1/e  and  N,  our  algorithm 
w'ill  achieve  return  exceeding  U*  -  e  with  high  proba¬ 
bility.  It  should  be  clear  that,  modulo  the  degree  of  the 
polynomial  running  time,  such  a  re.sult  is  the  best  that 
one  could  hope  for  in  general  MDP’s.  We  briefly  note 
that  in  the  case  of  discounted  reward,  we  can  still  hope 
to  compete  with  the  a.symptotically  optimal  policy  in 
time  polynomial  in  the  horizon  time;  this  is  discussed 
and  achieved  in  the  long  version. 

4  Main  Theorem 

We  are  now  ready  to  describe  our  learning  algorithm, 
and  to  state  and  prove  our  main  theorem:  namely,  that 
the  new  algorithm  will,  for  a  general  MDP,  achieve 
near-optimal  undiscounted  performance  in  polynomial 
time.  For  ease  of  exposition  only,  we  will  first  state 
the  theorem  under  the  assumption  that  the  learning  al¬ 
gorithm  is  given  as  input  a  “targeted”  mixing  time  T, 
and  the  value  opt(n^j')  of  the  optimal  return  achieved 
by  any  policy  mixing  within  T  steps.  These  assump¬ 
tions  are  entirely  removed  in  Section  4.6. 

Theorem  1  (Main  Theorem)  Let  M  be  a  Markov  de¬ 
cision  process  over  N  states.  Recall  that  Iljj'  is  the 
cla.ss  of  all  ergodic  policies  whose  e-return  mixing  time, 
is  bounded  by  T,  and  that  opf(n^j')  is  the.  optimal 
asymptotic  expected  undiscounted  return  achievable,  in 
There  exists  an  algorithm  A,  taking  inputs 
e,S,N,T  and  opf(n^j'),  such  that  if  the.  total  number 
of  actions  and  compiitntion  time,  taken  by  A  exceeds 
a  pobjnomial  in  l/e,l/6,  N ,  T,  and  R.max,  then  with 
probability  at  least  1—5,  the.  total  undiscounted  return 
of  A  will  exceed  opf(n^j')  —  c. 
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In  the  long  version,  we  give  a  similar  theorem  for  the 
discounted  case  (via  a  similar  algorithm),  with  the 
horizon  time  playing  the  role  of  T.  The  criterion  for 
success  needs  to  be  altered,  however,  since  in  the  dis¬ 
counted  case  it  is  not  possible  to  insist  that  the  actual 
return  achieved  by  the  learning  algorithm  approach 
the  optimal.  This  is  due  to  the  exponentially  damped 
contribution  of  successive  payoffs.  Intuitively,  in  the 
discounted  case  it  is  not  possible  for  a  learning  algo¬ 
rithm  to  recover  from  its  “youthful  mistakes”  as  it  can 
in  the  undiscounted  case,  so  we  must  settle  for  an  al¬ 
gorithm  that  simply  finds  a  near-optimal  policy  from 
its  current  state  after  a  short  learning  period. 

The  remainder  of  this  section  is  divided  into  several 
subsections,  each  describing  a  different  and  central  as¬ 
pect  of  the  algorithm  and  proof.  The  full  proof  of  the 
theorem  is  rather  technical,  but  the  underlying  ideas 
are  quite  intuitive,  and  we  sketch  them  first  as  an  out¬ 
line. 

4.1  Overview  of  the  Proof  and  Algorithm 

Our  algorithm  will  be  what  is  commonly  referred  to 
as  indirect  or  model-based:  namely,  rather  than  only 
maintaining  a  current  policy  or  value  function,  the  al¬ 
gorithm  will  actually  maintain  a  model  for  the  tran¬ 
sition  probabilities  and  the  expected  payoffs  for  some 
subset  of  the  states  of  the  unknown  MDP  M.  It  is 
important  to  emphasize  that  although  the  algorithm 
maintains  a  partial  model  of  M,  it  may  choose  to  never 
build  a  complete  model  of  M,  if  doing  so  is  not  neces¬ 
sary  to  achieve  high  return. 

It  is  easiest  to  imagine  the  algorithm  as  starting  off 
by  doing  what  we  will  call  balanced  wandering.  By 
this  we  mean  that  the  algorithm,  upon  arriving  in  a 
state  it  has  never  visited  before,  takes  an  arbitrary 
action  from  that  state;  but  upon  reaching  a  state  it 
has  visited  before,  it  takes  the  action  it  has  tried  the 
fewest  times  from  that  state  (breaking  ties  between  ac¬ 
tions  randomly).  At  each  state  it  visits,  the  algorithm 
maintains  the  obvious  statistics;  the  average  payoff 
received  at  that  state  so  far,  and  for  each  action,  the 
empirical  distribution  of  next  states  reached  (that  is, 
the  estimated  transition  probabilities). 

A  crucial  notion  for  both  the  algorithm  and  the  anal¬ 
ysis  is  that  of  a  known  state.  Intuitively,  this  is  a 
state  that  the  algorithm  has  visited  “so  many”  times 
(and  therefore,  due  to  the  balanced  wandering,  has 
tried  each  action  from  that  state  many  times)  that  the 
transition  probability  and  expected  payoff  estimates 
for  that  state  are  “very  close”  to  their  true  values  in 


M.  An  important  aspect  of  this  definition  is  that  it  is 
weak  enough  that  “so  many”  times  is  still  polynomially 
bounded,  yet  strong  enough  to  meet  the  simulation  re¬ 
quirements  we  will  outline  shortly. 

States  are  thus  divided  into  three  categories:  known 
states,  states  that  have  been  visited  before,  but  are  still 
unknown  (due  to  an  insufficient  number  of  visits  and 
therefore  unreliable  statistics),  and  states  that  have 
not  even  been  visited  once.  An  important  observation 
is  that  we  cannot  do  balanced  wandering  indefinitely 
before  at  least  one  state  becomes  known:  by  the  Pi¬ 
geonhole  Principle,  we  will  soon  start  to  accumulate 
accurate  statistics  at  some  state. 

Perhaps  our  most  important  definition  is  that  of  the 
known- state  MDP.  If  S  is  the  set  of  currently  known 
states,  the  current  known-state  MDP  is  simply  an 
MDP  Ms  that  is  naturally  induced  on  5  by  the  full 
MDP  M;  briefly,  all  transitions  in  M  between  states 
in  S  are  preserved  in  Ms,  while  all  other  transitions 
in  M  are  “redirected”  in  Ms  to  lead  to  a  single  ad¬ 
ditional,  absorbing  state  that  intuitively  represents  all 
of  the  unknown  and  unvisited  states. 

Although  the  learning  algorithm  will  not  have  direct 
access  to  Ms,  by  virtue  of  the  definitionof  the  known 
states,  it  will  have  an  approximation  Ms-  The  first 
of  two  central  technical  lemmas  that  we  prove  (Sec¬ 
tion  4.2)  shows  that,  under  the  appropriate  definition 
of  known  state,  Ms  will  have  good  simulation  accu¬ 
racy:  that  is,  the  expected  T-step  return  of  any  policy 
in  Ms  is  close  to  its  expected  T-step  return  in  Ms. 
(Here  T  is  the  mixing  time.)  Thus,  at  any  time,  Ms 
is  a  partial  model  of  M,  for  that  part  of  M  that  the 
algorithm  “knows”  very  well. 

The  second  central  technical  lemma  (Section  4.3)  is 
perhaps  the  most  enlightening  part  of  the  analysis, 
and  is  named  the  “Explore  or  Exploit”  Lemma.  It 
formalizes  a  rather  appealing  intuition:  either  the  opti¬ 
mal  (T-step)  policy  achieves  its  high  return  by  staying, 
with  high  probability,  in  the  set  S  of  currently  known 
states  —  which,  most  importantly,  the  algorithm  can 
detect  and  replicate  by  finding  ajngh-return  exploita¬ 
tion  policy  in  the  partial  model  Ms  —  or  the  optimal 
policy  has  significant  probability  of  leaving  S  within 
T  steps  —  which  again  the  algorithm  can  detect  and 
replicate  by  finding  an  exploration  policy  that  quickly 
reaches  the  additional  absorbing  state  of  the  partial 
model  Ms- 

Thus,  by  performing  two  off-line,  polynomial-time 
computations  on  Ms  (Section  4.4),  the  algorithm  is 
guaranteed  to  either  find  a  way  to  get  near-optimal 
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return  in  M  quickly,  or  to  find  a  way  to  improve  the 
statistics  at  an  unknown  or  unvisited  state.  Again  by 
the  Pigeonhole  Principle,  the  latter  case  cannot  occur 
too  many  times  before  a  new  state  becomes  known, 
and  thus  the  algorithm  is  always  making  progress.  In 
the  worst  case,  the  algorithm  will  build  a  model  of  the 
entire  MDP  M,  but  if  that  does  happen,  the  analysis 
guarantees  that  it  will  happen  in  polynomial  time. 

The  following  subsections  flesh  out  the  intuitions 
sketched  above,  providing  a  detailed  sketch  of  the 
proof  of  Theorem  1;  the  full  proofs  are  provided  in 
the  long  version.  In  Section  4.6,  we  discuss  how  to 
remove  the  assumed  knowledge  of  the  optimal  return 
and  the  targeted  mixing  time. 

4.2  The  Simulation  Lemma 

In  this  section,  we  prove  the  first  of  two  key  techni¬ 
cal  lemmas  mentioned  in^ie  sketch  of  Section  4.1: 
namely,  that  if  one  MDP  M  is  a  sufficiently  accurate 
approximation  of  another  MDP  M,  then  we  can  actu¬ 
ally  approximate  the  T-step  return  of  any^olicy  in  M 
quite  accurately  by  its  T-step  return  in  M.  The  im¬ 
portant  technical  point  is  that  the  goodness  of  approx¬ 
imation  required  depends  only  polynomially  on  1/T, 
and  thus  the  definition  of  known  state  will  require  only 
a  polynomial  number  of  visits  to  the  state.  Eventually, 
we  will  appeal  to  this  lemma  to  show  that  we  can  ac¬ 
curately  assess  the  return  of  policies  in  the  induced 
known-state  MDP  Ms  by  cornpjUing  their  return  in 
the  algorithm’s  approximation  Ms  (that  is,  we  will 
appe^to  ^mma  2  below  using  the  settings  M  =  Ms 
and  M  =  Ms)- 

We  begin  with  the  definition  of  approximation  we  re¬ 
quire. 

Definition  4  Let  M  and  M  be  Markov  decision  pra¬ 
ises  over  the  same  state  space.  Then  we  say  that 
M  is  an  Q-approximation  of  M  if  for  any  state  i, 
Rivrii)  —  ct  <  R^{i)  <  RM{i)  +  Q,  and  for  any  states 
i  and  j,  and  any  action  a,  Phiij)  -  a  <  PUij)  < 

■Pm(d)  +  «• 

We  now  state  and  prove  the  Simulation  Lemma,  which 
says  that  provided  M  is  sufficiently  close  to  M  in  tlic 
sense  just  defined,  the  T-step  return  of  policies  in  M 
and  M  will  be  similar. 

Lemma  2  (Simulation  Lemma)  Let  M  be  any 
Markov  decision  process  over  N  states.  Let  M  be 
an  0{{e/{NTRmax))‘^)-approximation  of  M .  Then  for 


any  policy  tt  and  for  any  state  i, 

UUhT)  -  e  <  Ul^{i,T)  <  Ul,{i,T)  +  c.  (1) 

Proof:(Sketch)  Let  M  be  an  Q-approximation  of  M, 
and  let  us  fix  a  policy  tt  and  a  start  state  i.  Let  us 
say  that  a  transition  from  a  state  i'  to  a  state  j’  un¬ 
der  action  a  is  /3-small  in  M  if  Pjlf{i'j')  <  /3.  It 
is  possible  to  bound  the  difference  between  Ulj[i,T) 
and  U^^{i,T)  contributed  by  tho.se  T-paths  that  cross 
at  least  one  /3-small  transition  by  (q  -I-  2(i)NTRmax 
(details  omitted).  For  the  value  of  a  stated  in  the 
theorem,  our  analysis  chooses  a  value  of  0  that  yields 
(q  -I-  2fd)NTRjnaT  <  e/4. 

Thus,  for  now  we  restrict  our  attention  to  the  walks  of 
length  T  that  do  not  cross  any  /3-small  transtion  of  M . 
It  can  be  shown  that  for  any  T-path  p  that,  under  tt, 
does  not  cross  any  /3-small  transitions  of  M,  we  have 

(1  -  A)^Pr;[,[p]  <  Pr^->]  <  (1  +  A)^PrJ,[p]  (2) 

where  A  =  q//3.  The  approximation  error  in  the  pay¬ 
offs  yields 

Um{p)  -  q  <  Uj^jip)  <  Um{p)  +  a.  (3) 

Since  the.se  inequalities  hold  for  any  fixed  T-path  that 
does  not  traverse  any  /3-small  transitions  in  M  under 
TT,  they  also  hold  when  we  take  expectations  over  the 
distributions  on  such  T-paths  in  M  and  M  induced  by 
TT.  Thus, 

(l-A)^[f/;,(f,T)-Q]-c/4  <  Ul[i,T) 

<(l  +  A)^[t/;,(f,T)  +  Q]+e/4 

where  the  additive  e/4  terms  account  for  the  contribu¬ 
tions  of  the  T-paths  that  traverse  ^-small  transitions 
under  tt,  as  bounded  above.  The  desired  constraint 
that  the  outermost  quantities  in  this  chain  of  inequal¬ 
ities  be  separated  by  an  additive  factor  of  at  most  2e 
determines  choices  for  A  and  a  that  yield  the  theorem 
(details  omitted).  □ 

What  role  does  T  play  in  the  Simulation  Lemma?  As 
we  make  T  larger,  M  must  be  a  better  approximation 
of  M  in  order  to  satisfy  the  conditions  of  the  Sim¬ 
ulation  Lemma  —  but  ^len  we  are  guaranteed  of  the 
simulation  accuracy  of  M  for  a  larger  number  of  steps. 
If  we  wish  to  “compete”  with  the  policies  in  ,  then 
by  appealing  to  the  Simulation  Lemma  using  T,  we  en¬ 
sure  that  the  asymptotic  return  in  M  of  any  policy  in 
is  well  approximated  by  its  T -step  return  in  M. 
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Thus,  the  Simulation  Lemma  essentially  determines 
what  the  definition  of  known  state  should  be:  one  that 
has  been  visited  enough  times  to  ensure  (with  high 
probability)  that  the  estimated  transition  probabilities 
and  the  estimated  payoffs  for  the  state  are  all  within 
0{{ef{NTRmax))'^)  of  their  true  values.  A  straight¬ 
forward  application  of  Chernoff  bounds  shows  that  the 
desired  approximation  will  be  achieved  for  those  states 
from  which  every  action  has  been  executed  at  least 

mknown  =  0{{{NTRmax)l()^  VaCmax  log(l/d))  (4) 

times,  where  VaCmax  =  max(l,  maxi[ VarM(*)])  the 
maximum  of  1  and  the  maximum  variance  of  the  ran¬ 
dom  payoffs  over  all  states. 

4.3  The  “Explore  or  Exploit”  Lemma 

The  Simulation  Lemma  indicates  the  degree  of  ap¬ 
proximation  required  for  sufficient  simulation  accu¬ 
racy,  and  led  to  the  definition  of  a  known  state.  If  we 
let  5  denote  the  set  of  known  states,  we  now  specify 
the  straightforward  way  in  which  these  known  states 
define  an  induced  MDP.  This  induced  MDP  has  an  ad¬ 
ditional  “new”  state,  which  intuitively  represents  all  of 
the  unknown  states  and  transitions. 

Definition  5  Let  M  be  a  Markov  decision  process, 
and  let  S  be  any  subset  of  the  states  of  M .  The  in¬ 
duced  Markov  decision  process  on  S,  denoted 
Ms,  has  states  S' U  {so},  and  transitions  and  payoffs 
defined  as  follows: 

•  For  any  state  i  E  S,  f?Ms(f)  =  Rnii);  oil  payoffs 
in  Ms  are  deterministic  (zero  variance)  even  if 
the  payoffs  in  M  are  stochastic. 

•  Rms{so)  —  0. 

•  For  any  action  a,  ^^^(soso)  =  1.  Thus,  sq  is  an 
absorbing  state. 

•  For  any  states  i,j  E  S,  and  any  action  a, 
P^giij)  =  P^ii)).  Thus,  transitions  in  M  be¬ 
tween  states  in  S  are  preserved  in  Ms. 

•  For  any  state  i  E  S  and  any  action  a,  P^^(iso)  = 

Thus,  all  transitions  in  M  that  are 
not  between  states  in  S  are  redirected  to  sq  in  Ms . 

Definition  5  describes  an  MDP  directly  induced  on  S 
by  the  true  unknown  MDP  M,  and  as  such  preserves 
the  true  transition  probabilities  between  states  in  S. 
Of  course,  our  algorithm  will  only  have  approximations 


to  these  transition  probabilities,  leading  to  the  follo^ 
ing  obvious  approximation  to  Ms:  if  we  simply  let  M 
denote  the  empirical  approximation  to  M  —  that  is, 
the  states  of  M  are  simply  all^e  states  visited  so  far, 
the  transition  probabilities  of  M  are  the  observed  tran¬ 
sition  frequencies,  and  the  rewards  are  the  observed  re¬ 
wards  —  then  Ms  is  the  natural  approximation  to  Ms. 
Now  if  we  let  S  be  the  set  of  known  states,  as  defiimd 
by  Equation  (4),  then  the  simulation  accuracy  of  Ms 
with  respect  to  Ms  in  the  sense  of  Equation  1  follows 
immediately  from  the  Simulation  Lemma.  Let  us  also 
observe  that  any  return  achievable  in  Ms  (and  thus 
approximately  achievable  in  Ms)  is  also  achievable  in 
the  “real  world”  M  —  that  is,  for  any  policy  tt  in  M, 
any  state  i  E  S,  and  any  T,  Ulj^{i,T)  <  U'^{i,T). 

We  are  now  at  the  heart  of  the  analysis:  we  have  iden¬ 
tified  a  “part”  of  the  unknown  MDP  M  that  the  algo¬ 
rithm  “knows”  very  well,  in  the  form  of  the  approxima¬ 
tion  Ms  to  Ms.  The  key  lemma  follows,  in  which  we 
demonstrate  the  fact  that  Ms  (and  thus,  by  the  Simu¬ 
lation  Lemma,  Ms)  must  always  provide  the  algorithm 
with  either  a  policy  that  will  yield  near-optimal  return 
in  the  true  MDP  M,  or  a.  policy  that  will  allow  rapid 
exploration  of  an  unknown  state  in  M  (or  both). 

Lemma  3  (Explore  or  Exploit  Lemma)  Let  M  be  any 
Markov  decision  process,  let  S  be  any  subset  of  the 
states  of  M,  and  let  Ms  be  the  induced  Markov  deci¬ 
sion  process  on  M .  For  any  i  E  S  and  any  T,  and  any 
1  >  a  >  0,  either  there  exists  a  policy  tt  in  Ms  such 
that  Uli^{i,T)  >  Ul[{i,T)  -  a,  or  there  exists  a  pol¬ 
icy  TT  in  Ms  such  that  the  probability  that  a  walk  ofT 
steps  following  tt  will  terminate  in  sq  exceeds  a/ Rmax  ■ 

Proof:Let  tt  be  a  policy  in  M  satisfying  U](f{i,T)  = 
Ulj{i,T),  and  suppose  that  U^^{i,T)  <  Ulj{i,T)  - 
a  (otherwise,  tt  already  witnesses  the  claim  of  the 
lemma).  We  may  write 

UUi,T)  =  '^^rltlppMip) 

P 

=  J2^rl,[q]UM{q)  +  '£^rl,[r]UM{r) 

q  T 

where  the  sums  are  over,  respectively,  all  T-paths 
p  in  M,  all  T-paths  q  in  M  in  which  every  state 
in  q  is  in  S',  and  all  T-paths  r  in  M  in  which  at 
least  one  state  is  not  in  S.  Keeping  this  interpreta¬ 
tion  of  the  variables  p,  q  and  r  fixed,  we  may  write 

Z)?  P’^m[9]^m(q)  =  ^  ^Ms  {i,T). 

The  equality  follows  from  the  fact  that  for  any  path 
q  in  which  every  state  is  in  S,  Pr)[^[q]  =  PrM5[q] 
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and  Um{q)  —  UMg{q),  and  the  inequality  from  the 
fact  that  takes  the  sum  over  all  T-paths  in 

Ms,  not  just  those  that  avoid  the  absorbing  state  sq. 
Thus  ^qPTl,[q]UMiq)  <  Ul,{i,T)  -  a  which  implies 
that  E7-PrJ^[r][/M(r)  >  o.  But  Er PrA/[r]f/A/(c)  < 
Rmax  Er  and  so  Er  Pr^[r]  >  aIRmax  as  de¬ 
sired.  □ 

4.4  Off-line  Optimal  Policy  Computations 

Let  us  take  a  moment  to  review  and  synthesize.  The 
combination  of  the  simulation  accuracy  of  Ms  and  the 
Explore  or  Exploit  Lemma  establishes  our  basic  line 
of  argument: 

•  At  any  time,  if  S  is  the  set  of  current  kjmwn  states, 
the  T-step  return  of  any  policy  tt  in  Mg  (approx¬ 
imately)  lower  bounds  the  T-step  return  of  (any 
extension  of)  tt  in  M. 

•  At  any  time,  there  must  either  be  a  policy  in  Ms 
whose  T-step  return  in  M  ]s^  nearly  optimal,  or 
there  must  be  a  policy  in  Ms  that  will  quickly 
reach  the  absorbing  state  —  in  which  case,  this 
same  policy,  executed  in  M,  will  quickly  reach  a 
state  that  is  not  currently  in  the  known  set  S. 

At  certain  points  in  the  execution  of  the  algorithm, 
we  will  perform  T-step  value  iteration  (which  takes 
0{N‘^T)  computation)  off-line  twice;  once  on  Ms,  and 
a  second  time  on  what  we  will  denote  Mg.  The  MDP 
Mg  has  the  same  transition  probabilities  as  Ms,  but 
different  payoffs:  in  Mg,  the  absorbing  state  sq  has 
payoff  Rmax  and  all  other  states  have  payoff  0.  Thus 
we  reward  exploration  (as  represented  by  visits  to  sq) 
rather  than  exploitation.  If  9  is  the  policy  returned 
by  value  iteration  on  and  9'  is  the  policy  returned 
by  value  iteration  on  Mg,  then  Lemma  3  guarantees 
that  either  the  T-step  return  of  9  from  our  current 
known  state  approaches  the  optimal  achievable  in  M 
(which  for  now  we  are  assuming  we  know,  and  can 
thus  detect),  or  the  probability  that  9'  reaches  sq, 
and  thus  that  the  execution  of  9'  in  M  reaches  an 
unknown  or  unvisited  state  in  T  steps  with  significant 
probability  (which  we  can  also  detect).  Finally,  note 
that  even  though  T-step  value  iteration  produces  a 
non-stationary  policy,  it  is  the  expected  payoff  that 
is  important,  not  whether  we  follow  a  stationary  or 
non-stationary  policy. 


4.5  Putting  it  All  Together 

All  of  the  technical  pieces  we  need  are  now  in  place, 
and  we  now  give  a  more  detailed  description  of  the 
algorithm,  and  sketch  the  remainder  of  the  analysis. 
(Again,  full  details  are  provided  in  the  long  vension.) 
We  emphasize  that  for  now  we  assume  the  algorithm 
is  given  as  input  a  targeted  mixing  time  T  and  the 
optimal  return  opt  (IT  a/)  achievable  in  In  Sec¬ 

tion  4.6,  we  remove  these  assumptions. 

We  call  the  algorithm  Explicit  Explore  or  Exploit,  or 
E^,  because  whenever  the  algorithm  is  not  engaged 
in  balanced  wandering,  it  performs  an  explicit  off-line 
computation  on  the  partial  model  in  order  to  find  a 
T-step  policy  guaranteed  to  either  explore  or  exploit. 

Explicit  Explore  or  Exploit  (E^)  Algorithm: 

•  (Initialization)  Initially,  the  set  5  of  known  state.s  is 
empty. 

•  (Balanced  Wandering)  Any  time  the  current  state  is 
not  in  S,  the  algorithm  performs  balanced  wandering. 

•  (Discovery  of  New  Known  States)  Any  time  a  state  i 

has  been  visited  times  during  balanced  wan¬ 

dering,  it  enters  the  known  sot  S,  and  no  longer  par¬ 
ticipates  in  balanced  wandering. 

•  Observation:  Clearly,  after  A(m;.„„,„„  —  1)4-1  steps 
of  balanced  wandering,  by  the  Pigeonhole  Principle 
some  state  becomes  known.  More  generally,  if  the 
total  number  of  steps  of  balanced  wandering  the  al¬ 
gorithm  has  performed  ever  exceeds  Nmknnum,  then 
every  state  of  M  is  known  (even  if  these  steps  of  bal¬ 
anced  wandering  are  not  consecutive). 

•  (Off-line  Optimizations)  Upon  reaching  a  known  state 
i  €  S  during  balanced  wandering,  the  algorithm  per¬ 
forms  the  two  off-line  optimal  policy  computations  on 
A/.s  and  Mg  described  in  Section  4.4: 

—  (Attempted  Exploitation)  If  the  resulting  ex¬ 
ploitation  policy  9  achieves  return  from  i  in  M.g 
that  is  at  least  opt{Tl^}’)  —  e/2,  the  algorithm 
executes  9  for  the  next  T  steps. 

—  (Attempted  Exploration)  Otherwise,  the  algo¬ 
rithm  executes  the  resulting  exploration  policy 
9'  (derived  from  the  off-line  computation  on  Mg) 
for  T  steps  in  M,  which  by  Lemma  3  is  guaran¬ 
teed  to  have  probability  at  least  e/{2R,„„r)  of 
leaving  the  set  S. 

•  (Balanced  Wandering)  Any  time  an  attempted  ex¬ 
ploitation  or  attempted  exploration  visits  a  state  not 
in  S,  the  algorithm  immediately  resumes  balanced 
wandering. 


Near-Optimal  Reinforcement  Learning  in  Polynomial  Time  267 


This  concludes  the  description  of  the  algorithm;  we  can 
now  wrap  up  the  analysis.  One  of  the  main  remain¬ 
ing  issues  is  our  handling  of  the  confidence  parameter 
<5  in  the  statement  of  the  main  theorem:  Theorem  1 
ensures  that  a  certain  performance  guarantee  is  met 
with  probability  at  least  1  -  <5.  There  are  essentially 
three  different  sources  of  failure  for  the  algorithm: 

•  At  some  known  state,  the  algorithm  actually  has 
a  poor  approximation  to  the  next-state  distribu¬ 
tion  for  some  action,  and  thus  Ms  does  not  have 
sufficiently  strong  simulation  accuracy  for  Ms- 

•  Repeated  attempted  explorations  fail  to  yield 
enough  steps  of  balanced  wandering  to  result  in  a 
new  known  state. 

•  Repeated  attempted  exploitations  fail  to  result  in 
actual  return  that  is  near  opt{IL^fJi^). 

Our  handling  of  the  failure  probability  8  is  to  simply 
allocate  d/3  to  each  of  these  sources  of  failure.  The  fact 
that  we  can  make  the  probability  of  the  first  source  of 
failure  (a  “bad”  known  state)  small  is  handled  by  a 
standard  Chernoff  bound  analysis  applied  to  the  defi¬ 
nition  of  known  states. 

For  the  second  source  of  failure  (failed  attempted  ex¬ 
plorations),  a  standard  Chernoff  bound  analysis  again 
suffices:  by  Lemma  3,  each  attempted  exploration 
can  be  viewed  as  an  independent  Bernoulli  trial  with 
probability  at  least  €/{2Rmax)  of  “success”  (at  least 
one  step  of  balanced  wandering).  In  the  worst  case, 
we  must  make  every  state  known  before  we  can  ex¬ 
ploit,  requiring  Nmknown  steps  of  balanced  wander¬ 
ing.  The  probability  of  having  fewer  than  Nmknown 
steps  of  balanced  wandering  will  be  smaller  than  6/3 
if  the  number  of  (T-step)  attempted  explorations  is 
0{{R 

max  /e)N\og{l/S)mkn  own')- 

Finally,  we  do  not  want  to  simply  halt  upon  finding 
a  policy  whose  expected  return  is  near  opt  (11^^),  but 
want  to  achieve  actual  return  approaching  opt{IL^^), 
which  is  where  the  third  source  of  failure  (failed  at¬ 
tempted  exploitations)  enters.  We  have  already  ar¬ 
gued  that  the  total  number  of  T-step  attempted  ex¬ 
plorations  the  algorithm  can  perform  before  S  con¬ 
tains  all  states  of  M  is  polynomially  bounded.  All 
other  actions  of  the  algorithm  must  be  accounted  for 
by  T-step  attempted  exploitations.  Each  of  these  T- 
step  attempted  exploitations  has  expected  return  at 
least  opt(n^^)  -  e/2.  The  probability  that  the  ac¬ 
tual  return,  restricted  to  just  these  attempted  exploita¬ 
tions,  is  less  than  opt{Il^^)  —  3e/4,  can  be  made 


smaller  than  8/3  if  the  number  of  blocks  exceeds 
0((l/e)^  log(l/d));  this  is  again  by  a  standard  Cher- 
nofl"  bound  analysis.  However,  we  also  need  to  make 
sure  that  the  return  restricted  to  these  exploitation 
blocks  is  sufficient  to  dominate  the  potentially  low  re¬ 
turn  of  the  attempted  explorations.  It  is  not  difficult 
to  show  that  provided  the  number  of  attempted  ex¬ 
ploitations  exceeds  0(1/ e)  times  the  number  of  at¬ 
tempted  explorations,  both  conditions  are  satisfied, 
for  a  total  number  of  actions  bounded  by  0{T/e) 
times  the  number  of  attempted  explorations,  which 
is  0(A’T(Rmoi/e^)log(l/d)mfc„o„„).  The  total  com¬ 
putation  time  is  thus  0{N'^T/e)  times  the  number  of 
attempted  explorations,  and  thus  bounded  by 

0{N^T{Rmax/e^)\og{l/8)mknown)-  (5) 

This  concludes  the  proof  of  the  main  theorem.  We  re¬ 
mark  that  no  serious  attempt  to  minimize  these  worst- 
case  bounds  has  been  made;  our  immediate  goal  was  to 
simply  prove  polynomial  bounds  in  the  most  straight¬ 
forward  manner  possible.  It  is  likely  that  a  practical 
implementation  based  on  the  algorithmic  ideas  given 
here  would  enjoy  performance  on  natural  problems 
that  is  considerably  better  than  the  current  bounds 
indicate.  (See  Moore  and  Atkeson,  1993,  for  a  related 
heuristic  algorithm.) 

4.6  Eliminating  Knowledge  of  T  and  opt(n]|)') 

In  order  to  simplify  our  presentation  of  the  main  the¬ 
orem  and  the  E®  algorithm,  we  made  the  assumption 
that  the  learning  algorithm  knew  the  targeted  mixing 
time  T  and  the  target  optimal  return  opt{IL^^)  achiev¬ 
able  in  this  mixing  time.  In  this  section,  we  briefly  out¬ 
line  the  straightforward  way  in  which  these  assump¬ 
tions  can  be  removed  without  changing  the  qualitative 
nature  of  the  results.  Details  are  in  the  long  version 
of  this  paper. 

In  the  absence  of  knowledge  of  opt{IL^^),  the  Explore 
or  Exploit  Lemma  (Lemma  3)  ensures  us  that  it  is  safe 
to  have  a  bias  towards  exploration.  More  precisely, 
any  time  we  arrive  in  a  known  state  i,  we  will  first 
determine  the  exploration  policy  if'  and  compute  the 
probability  that  9'  will  reach  the  absorbing  state  sq  of 
Mg  in  T  steps.  We  can  then  compare  this  probability 
to  the  lower  bound  e/{2Rmax)  of  Lemma  3.  As  long  as 
this  lower  bound  is  exceeded,  we  may  execute  if'  in  an 
attempt  to  visit  a  state  not  in  5.  If  this  lower  bound 
is  not  exceeded.  Lemma  3  guarantees  that  the  off-line 
computation  on  Ms  in  the  Attempted  Exploitation 
step  must  result  in  an  exploitation  policy  if  that  is 
close  to  optimal.  We  execute  if  in  M  and  continue. 
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Note  that-  this  exploration-biased  solution  to  remov¬ 
ing  knowledge  of  opt{YlJ^j)  or  V’{i)  results  in  the  al¬ 
gorithm  always  exploring  all  states  of  M  that  can  be 
reached  in  a  reasonable  amount  of  time,  before  doing 
any  exploitation.  This  is  a  simple  way  of  removing  the 
knowledge  while  keeping  a  polynomial-time  algorithm; 
but  we  explore  more  practical  variants  of  this  strategy 
in  the  longer  paper. 

To  remove  the  assumed  knowledge  of  T,  we  observe 
that  we  already  have  an  algorithm  A{T)  that,  given  T 
as  input,  runs  for  P{T)  steps  for  some  fixed  polynomial 
P(')  and  meets  the  desired  criterion.  We  now  propose 
a  new  algorithm  A',  which  does  not  need  T  as  input, 
and  simply  runs  A  sequentially  for  T  =  1, 2, 3, . . ..  For 
any  T,  the  amount  of  time  A'  must  be  run  before  A' 
has  executed  A{T)  is  P{t)  <  TP{T)  =  P'{T), 
which  is  still  polynomial  in  T.  We  just  need  to  run  A' 
for  sufficiently  many  steps  after  the  first  P'{T)  steps 
to  dominate  any  low-return  periods  that  took  place 
in  those  P'{T)  steps,  similar  to  the  analysis  done  for 
the  undiscounted  case  towards  the  end  of  Section  4.5. 
We  again  note  that  this  solution,  while  sufficient  for 
polynomial  time,  is  not  the  one  we  would  implement 
in  practice. 

5  Conclusion 

In  this  paper,  we  presented  the  algorithm,  and 
showed  that  it  achieves  near-optimal  undiscounted  re¬ 
turn  in  general  MDP’s  in  polynomial  time.  In  the  long 
version,  we  show  that  a  slight  modification  of  E®  gives 
similar  results  for  the  discounted  case,  that  the  algo¬ 
rithms  can  deal  with  MDP’s  with  terminating  states  in 
a  natural  way,  and  that  they  also  work  in  multichain 
MDP’s. 

There  are  a  number  of  interesting  lines  for  further  re¬ 
search.  We  are  developing  the  basic  ideas  underlying 
E^  into  a  practical  algorithm,  and  hope  to  report  on 
an  implementation  and  experiments  soon.  Finding  an 
efficient  model-free  version  of  our  algorithm,  and  tech¬ 
niques  for  dealing  with  large  state  spaces,  remain  for 
future  work. 
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Abstract 

In  this  work,  we  present  a  new  bottom-up  al¬ 
gorithm  for  decision  tree  pruning  that  is  very 
efficient  (requiring  only  a  single  pass  through 
the  given  tree),  and  prove  a  strong  performance 
guarantee  for  the  generalization  error  of  the  re¬ 
sulting  pruned  tree.  We  work  in  the  typical  set¬ 
ting  in  which  the  given  tree  T  may  have  been 
derived  from  the  given  training  sample  5,  and 
thus  may  badly  overfit  5.  In  this  setting,  we 
give  bounds  on  the  amount  of  additional  gener¬ 
alization  error  that  our  pruning  suffers  compared 
to  the  optimal  pruning  of  T.  More  generally,  our 
results  show  that  if  there  is  a  pruning  of  T  with 
small  error,  and  whose  size  is  small  compared  to 
|S|,  then  our  algorithm  will  find  a  pruning  whose 
error  is  not  much  larger.  This  style  of  result 
heis  been  called  an  index  of  resolvability  result 
by  Barron  and  Cover  in  the  context  of  density 
estimation. 

Our  algorithm  is  local  —  the  decision  to  prune 
a  subtree  is  based  entirely  on  properties  of  that 
subtree  and  the  sample  reaching  it.  To  analyze 
our  algorithm,  we  develop  tools  of  local  uniform 
convergence,  a  generalization  of  the  standard  no¬ 
tion  that  may  prove  useful  in  other  settings. 


1  Introduction 

We  consider  the  common  problem  of  finding  a  good 
pruning  of  a  given  decision  tree  T  on  the  basis  of  sam¬ 
ple  data  5.  We  work  in  a  setting  in  which  we  do  not 
assume  the  independence  of  T  and  5.  In  particular,  we 
allow  for  the  possibility  that  T  was  in  fact  constructed 
from  S,  perhaps  by  a  standard  greedy,  top-down  pro¬ 
cess  as  employed  in  the  growth  phases  of  the  C4.5  and 
CART  algorithms  [8,  3].  Our  interest  here  is  in  how 
one  should  best  use  the  data  S  a  second  time  to  find  a 
good  subtree  of  T.  Note  that  T  itself  may  badly  overfit 
the  data. 


Our  main  result  is  a  new  and  rather  efficient  pruning 
algorithm,  and  the  proof  of  a  strong  performance  guar¬ 
antee  for  this  algorithm  (Theorems  5  and  6).  Our  algo¬ 
rithm  uses  the  sample  S  to  compute  a  subtree  (prun¬ 
ing)  of  T  whose  generalization  error  can  be  related  to 
that  of  the  best  pruning  of  T.  More  generally,  the  gen¬ 
eralization  error  of  our  pruning  is  bounded  by  the  min¬ 
imum,  over  all  prunings  T',  of  the  generalization  error 
e(T')  plus  a  “complexity  penalty”  that  depends  only 
on  the  size  of  T'.  Thus,  if  there  is  a  relatively  small 
subtree  of  T  with  small  error,  our  algorithm  enjoys  a 
strong  performance  guarantee.  This  type  of  guarantee 
is  fairly  common  in  the  model  selection  literature,  and 
is  sometimes  referred  to  as  an  index  of  resolvability 
guarantee  [1] .  (It  is  also  similar  to  the  types  of  results 
stated  in  the  literature  on  combining  “experts”  [4],  al¬ 
though  the  interest  there  is  not  in  generalization  error, 
but  in  on-line  prediction.  This  is  discussed  further  be¬ 
low.)  Our  algorithm  is  a  simple,  bottom-up  algorithm 
that  performs  a  single  pass  over  the  tree  T ;  hence  its 
running  time  is  linear  in  sizeiT).  The  only  informa¬ 
tion  our  algorithm  needs  for  this  bottom-up  pass  is, 
for  each  node  in  T,  the  depth  of  the  node  in  T,  the 
size  of  the  subtree  rooted  at  the  node,  and  the  number 
of  positive  and  negative  examples  reaching  the  node. 
This  information  is  typically  available  from  the  con¬ 
struction  of  the  tree,  or  can  be  computed  explicitly  in 
time  0{\S\depth{T)). 

An  important  aspect  of  our  algorithm  is  its  locality. 
Roughly  speaking,  this  means  that  the  decision  to 
prune  or  not  prune  a  particular  subtree  during  the  ex¬ 
ecution  is  based  entirely  on  properties  of  that  subtree 
and  the  sample  that  reaches  it.  A  number  of  common 
pruning  methods  behave  locally  in  this  sense.  The 
analysis  of  our  algorithm  requires  us  to  develop  the 
notion  of  local  uniform  convergence,  a  generalization 
of  the  standard  notion  of  uniform  convergence,  and  a 
tool  that  we  believe  may  prove  useful  in  other  settings. 
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2  Related  Work 

There  are  a  number  of  previous  efforts  related  to  our 
results,  which  we  only  have  space  to  discuss  briefly 
here;  more  detailed  comparisons  will  be  given  in  the 
full  paper.  First  of  all,  our  pruning  algorithm  is  closely 
related  to  one  proposed  by  Mansour  [7],  who  gave  pri¬ 
marily  experimental  results,  and  did  not  give  bounds 
on  the  generalization  error  of  the  resulting  pruned  tree. 

Helmbold  and  Schapire  [4]  gave  an  efficient  algorithm 
for  predicting  nearly  as  well  as  the  best  pruning  of  a 
given  tree.  However,  this  algorithm  differs  from  ours 
in  a  number  of  important  ways.  First  of  all,  it  can¬ 
not  be  directly  applied  to  the  same  data  set  that  was 
used  to  derive  the  given  tree  in  order  to  obtain  a  good 
pruning  —  the  predictive  power  is  only  on  a  “fresh”  or 
held-out  data  set.  (A  standard  transformation  of  their 
algorithm  can  be  used  on  the  original  data  set,  but 
results  in  a  considerably  less  efficient  algorithm,  as  it 
requires  many  executions  of  the  algorithm.)  Second,  it 
does  not  actually  find  a  good  pruning  of  the  given  tree, 
but  a  weighted  combination  of  prunings.  However,  in 
the  on-line  prediction  model  of  learning,  their  result  is 
quite  strong.  Here  we  study  the  typical  batch  model 
in  which  we  may  not  assume  independence  of  our  tree 
and  data  set. 

The  use  of  dynamic  programming  for  pruning  was  al¬ 
ready  suggested  in  the  original  book  on  CART  [3]  in 
order  to  minimize  a  weighted  sum  of  the  observed  error 
and  the  size  of  the  pruning.  Bohanec  and  Bratko  [2] 
showed  that  it  is  possible  to  compute  in  quadratic  time 
the  subtree  of  a  given  tree  that  minimizes  the  training 
error  while  obeying  a  specified  size  bound.  By  com¬ 
bining  this  observation  with  the  ideas  of  structural  risk 
minimization  [10],  it  is  possible  to  derive  a  polynomial¬ 
time  algorithm  for  our  setting  with  error  guarantees 
quite  similar  to  those  we  will  give  for  our  algorithm. 
However,  this  algorithm  would  be  considerably  less  ef¬ 
ficient  than  the  one  we  shall  present. 

Finally,  our  ideas  are  certainly  influenced  by  the  many 
single-pass,  bottom-up  pruning  heuristics  in  wide  use 
in  experimental  machine  learning,  including  that  used 
by  C4.5  [8].  While  we  do  not  know  how  to  prove 
strong  error  guarantees  for  these  heuristics,  our  cur¬ 
rent  results  provide  some  justification  for  them,  and 
suggest  specific  modifications  that  yield  fast,  practi¬ 
cal  and  principled  methods  for  pruning  with  proven 
error  guarantees.  Combined  with  earlier  results  prov¬ 
ing  non-trivial  performance  guarantees  for  the  com¬ 
mon  greedy,  top-down  growth  heuristics  in  the  model 
of  boosting  [5],  it  is  fair  to  say  that  there  is  now  a  solid 


theoretical  basis  for  both  the  top-down  and  bottom-up 
passes  of  many  standard  decision  tree  learning  algo¬ 
rithms. 

3  Framework  and  Preliminaries 

Wc  consider  decision  trees  over  an  input  domain  A". 
Each  such  tree  has  binary  tests  at  each  internal  node, 
where  each  test  is  chosen  from  a  class  T  of  predicates 
over  X.  We  use  trees(T,  d)  to  denote  the  class  of  all 
binary  trees  with  tests  from  T  and  at  most  d  internal 
nodes,  and  leaves  labeled  with  0  or  1. 

Wc  will  also  need  notation  to  identify  paths  in  a 
decision  tree.  Thus,  we  use  paths(T,  to  denote 
the  class  of  all  conjunctions  of  at  most  predicates 
from  T.  Clearly,  if  u  is  a  node  in  a  decision  tree 
T  G  trees(T,  d),  then  we  may  associate  with  v  a  pred¬ 
icate  rcachi-  £  PATHS(T,  d),  which  is  simply  the  con¬ 
junction  of  the  predicates  along  the  path  from  the  root 
to  V  in  T.  Tims,  for  any  input  x  G  X,  reach, ,{x)  =  1  if 
and  only  if  the  path  defined  by  x  in  T  passes  through 
V. 

Given  a  node  v  in  T,  wc  let  T,,  denote  the  subtree  of  T 
that  is  rooted  at  v,  and  for  any  probability  distribution 
P  over  A”,  we  let  P,.  denote  the  distribution  induced 
by  P  on  just  those  x  satisfying  reach, .{x)  =  1. 

In  our  framework,  there  is  an  unknown  distribution 
P  over  X  and  an  unknown  target  fimction  f  over  X. 
We  are  given  a  sample  S  of  m  pairs  {xi,f{xi)),  where 
each  Xi  is  drawn  independently  according  to  P.  We 
are  also  given  a  tree  T  =  T{S)  that  may  have  been 
built  from  the  sample  S.  Now  for  /  and  T  fixed,  for 
any  distribution  P,  we  define  the  generalization  error 
e{T)  —  ep{T)  =  Prp[T{x)  ^  f{x)],  and  also  the  train¬ 
ing  error  e(T)  =  es(T)  =  (1/m)  /[P(x)  ^  /(x)], 

where  7  is  the  indicator  function.  In  this  notation,  for 
any  node  v  in  T,  we  can  define  the  local  generaliza¬ 
tion  error  e,,  =  ep^,(T,,)  and  the  local  training  error 
e„  -  7^  /(•'^)]’  where  5„  is  the 

set  of  all  X  G  5  satisfying  reac.h„{x)  =  1.  We  will  also 
need  to  refer  to  the  local  errors  incurred  by  deleting 
the  subtree  T,,  and  replacing  it  by  a  leaf  with  the  ma¬ 
jority  label  of  the  examples  reaching  v.  Thus,  we  use 
f,,(0)  to  denote  min{PrpJ/(x)  =  0],PrpJ/(x)  =  1]}; 
this  is  exactly  the  error,  with  respect  to  P„,  of  the  op¬ 
timal  constant  function  (leaf)  0  or  1.  Similarly,  wc  will 
use  ei,(0)  to  denote  the  quantity 

(l/|S,.|)min{|{x  G  :  /(x)  =  0}|, 

|{x  G  S,,  : /(x)  =  1}]}  (1) 

which  is  the  observed  local  error  incurred  by  replacing 
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Ty  by  the  best  leaf. 

As  we  have  mentioned  in  the  introduction,  we  make 
no  assumptions  on  /,  and  our  goal  is  not  to  “learn”  / 
in  the  standard  sense  of,  say,  fitting  a  decision  tree  to 
the  data  and  hoping  that  it  generalizes  well.  Here  we 
limit  our  attention  to  the  problem  of  pruning  a  given 
decision  tree.  Thus,  we  assume  that  we  are  given  as 
input  the  sample  S  and  a  particular,  fixed  tree  T,  with 
the  goal  of  finding  a  pruning  of  T  with  near-optimal 
generalization. 

It  is  important  to  specify  what  we  mean  by  a  prun¬ 
ing  of  T,  since  allowing  different  pruning  operations 
clearly  can  result  in  different  classes  of  trees  that  can 
be  obtained  from  T.  We  let  prunings(T)  denote 
the  class  of  all  subtrees  of  T,  that  is,  the  class  of 
all  trees  that  can  be  obtained  from  T  by  specifying 
nodes  ui , . . . ,  u*  in  T  and  then  deleting  from  T  the 
subtrees  , . . . ,  rooted  at  those  nodes.  The  al¬ 
lowed  operation  is  that  of  deleting  any  subtree  from 
the  current  tree;  prunings(T)  is  exactly  the  class  of 
trees  that  can  be  obtained  from  T  by  any  sequence 
of  such  operations.  Thus,  any  non-empty  tree  in 
prunings(T)  shares  the  same  root  as  T,  and  can  be 
“superimposed”  on  T.  In  particular,  we  are  not  al¬ 
lowing  operations  such  as  the  replacement  of  an  inter¬ 
nal  node  by  its  left  or  right  subtree  [8].  Nevertheless, 
the  class  prunings(T)  contains  an  exponential  num¬ 
ber  of  subtrees  of  T,  and  our  goal  will  be  find  a  tree  in 
PRUNlNGS(T)  with  close  to  the  smallest  generalization 
error. 

Let  us  again  emphasize  that  we  do  not  assume  any  “in¬ 
dependence”  between  the  given  tree  T  and  the  sample 
S  —  indeed,  the  likely  scenario  is  that  T  was  built  us¬ 
ing  S.  Formally,  we  are  given  a  pair  (5,T)  in  which 
we  allow  T  =  T{S).  We  are  imagining  the  common 
scenario  in  which  the  sample  5  is  to  be  used  twice  — 
once  for  top-down  growth  of  T  using  a  heuristic  such 
as  those  used  by  C4.5  or  CART,  and  now  again  to  find 
a  good  subtree  of  T.  If  one  assumes  that  5  is  a  “fresh” 
or  held-out  sample  (that  is,  drawn  separately  from  the 
sample  used  to  construct  T),  the  problem  becomes  eas¬ 
ier  in  some  ways,  since  one  can  then  use  the  observed 
error  on  S  as  an  approximate  proxy  for  the  general¬ 
ization  error  of  any  tree  in  PRUNlNGS(r).  There  is  a 
trade-off  that  renders  the  two  scenarios  incomparable 
in  general  [6]:  by  using  a  hold-out  set  for  the  pruning 
phase,  we  gain  the  independence  of  the  sample  from 
the  given  tree  T,  but  at  the  price  of  having  “wasted” 
some  potentially  valuable  data  for  the  training  (con¬ 
struction)  of  T;  whereas  in  our  setting,  we  waste  no 
data,  but  cannot  exploit  independence  of  T  and  S. 


In  the  hold-out  setting,  a  good  algorithm  is  one  that 
chooses  the  tree  in  prunings(T)  that  minimizes  the 
error  on  S  (which  can  be  computed  in  polynomial 
time  via  a  dynamic  programming  approach  [2]),  and 
fairly  general  performance  guarantees  can  be  shown  [6] 
that  necessarily  weaken  as  the  hold-out  set  becomes  a 
smaller  fraction  of  the  original  data  sample. 

4  Description  of  the  Algorithm 

We  begin  with  a  detailed  description  of  the  pruning 
algorithm,  which  is  given  the  random  sample  S  and  a 
tree  T  =  T{S)  as  input.  The  high-level  structure  of 
the  algorithm  is  quite  straightforward:  the  algorithm 
makes  a  single  “bottom-up”  pass  through  T,  and  de¬ 
cides  for  every  node  v  whether  to  leave  the  subtree 
currently  rooted  at  v  in  place  (at  least  for  the  mo¬ 
ment),  or  whether  to  delete  this  subtree.  More  pre¬ 
cisely,  imagine  that  we  place  a  marker  at  each  leaf  of 
T,  and  for  any  node  v  in  T,  let  MARKERS(u)  denote  the 
set  of  markers  in  the  subtree  Ty  rooted  at  v.  When  all 
of  the  markers  in  MARKERS(u)  have  arrived  at  v,  our 
algorithm  will  then  (and  only  then)  consider  whether 
or  not  to  delete  the  subtree  then  rooted  at  v;  the  al¬ 
gorithm  then  passes  all  of  these  markers  to  its  parent. 
Thus,  the  algorithm  only  considers  pruning  at  a  node 
V  once  it  has  first  considered  pruning  at  all  nodes  be¬ 
low  v;  this  simply  formalizes  the  standard  notion  of 
“bottom-up”  processing.  Also,  note  that  the  size  of 
the  subtree  rooted  at  v  is  easily  computed  by  counting 
the  number  of  markers  arriving  there. 

Two  observations  are  in  order  here.  First,  the  algo¬ 
rithm  considers  a  pruning  operation  only  once  at  each 
node  V  of  T,  at  the  moment  when  all  of  MARKERS(t;) 
resides  at  v.  Second,  the  subtree  rooted  at  v  when  all 
of  MARKERS(u)  reside  at  v  may  be  different  than  Ty 
(the  original  subtree  of  T  rooted  at  u),  because  parts 
of  Ty  may  have  been  deleted  as  markers  were  being 
passed  up  towards  v.  We  thus  introduce  the  notation 
T*  to  denote  the  subtree  that  is  rooted  at  v  when  all  of 
MARKERS(u)  resides  at  v.  It  is  T*  that  our  algorithm 
must  decide  whether  to  prune,  and  T*  is  defined  by 
the  operation  of  the  algorithm  itself.  We  will  use  T* 
to  denote  the  final  pruning  of  T  output  by  our  algo¬ 
rithm. 

It  remains  only  to  describe  how  our  algorithm  decides 
whether  or  not  to  prune  T*.  For  this  we  need  some 
additional  notation.  We  define  ruy  =  |S„|,  and  we  let 
Sy  denote  the  number  of  nodes  in  T*,  and  ly  be  the 
depth  of  the  node  v  in  T.  Recall  that  e„(T*)  is  the 
fraction  of  errors  T*  makes  on  the  local  sample  Sy, 
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and  e„(0)  is  the  fraction  of  errors  the  best  leaf  makes 
on  Sy.  Then  our  algorithm  will  replace  T*  by  this  best 
leaf  if  and  only  if 

ey{T*)  +  a{my,SyJy,S)  >  e„(0)  (2) 

where  S  G  [0, 1]  is  a  confidence  parameter.  The  exact 
choice  of  a{my,Sy,Ey,S)  will  depend  on  the  setting, 
but  in  all  cases  can  be  thought  of  as  a  penalty  for  the 
complexity  of  the  subtree  T*.  Let  us  first  consider 
the  case  in  which  the  class  T  of  testing  functions  is  fi¬ 
nite,  in  which  case  the  class  of  possible  path  predicates 
PATHS(T,  (y)  leading  to  v  and  the  class  of  possible  sub¬ 
trees  TREES(T, St,)  rooted  at  v  are  also  finite.  In  this 
case,  we  would  choose 


a{my,Sy,iy,6)  =  Ci 


llogiA)  +  log(B)  -b  log(m/(S) 


(3) 

for  some  constant  specific  c  >  1  determined  by  the 
analysis,  where 


/I  =  |paths(T,4)|  (4) 


and 

B  =  \TREES{T,Sy)\.  (5) 


Perhaps  the  most  natural  and  common  special  case  of 
this  finite-cardinality  setting  is  that  in  which  the  input 
space  X  is  the  boolean  hypercube  {0,1}",  and  the 
test  class  T  contains  just  the  n  single- variable  tests  x,. 
These  are  the  kinds  of  tests  allowed  in  the  vanilla  C4.5 
and  CART  packages,  and  since  |paths(T,  ^)|  <  and 
1trees(7’,  s)|  <  {any  for  some  constant  a,  Equation 
(3)  specializes  to 


Oi{nay  ^  Sy  J  £y  J  —  c 


(4  +  s„)  log(n)  +  log(m/(5) 


m.i 


(6) 

for  some  specific  constant  c'  >  1  determined  by  the 
analysis.  To  simplify  the  exposition  and  to  make  it 
more  concrete,  we  will  work  with  this  particular  choice 
of  T  in  most  of  our  proofs,  but  specifically  point  out 
how  the  analysis  changes  for  the  case  of  infinite  T, 
where  the  pruning  rule  is  given  by  choosing 


a{my,Sy,ty,S)  =  c" 


(d^,,  +  ds^ )  log(277i)  -f  \og{m/S) 

ruy 


(7) 

for  some  specific  constant  c"  >  1  determined  by  the 
analysis,  where  and  are  the  VC  dimensions 
of  the  classes  paths(T,  4)  and  trees(T,  s„),  respec¬ 
tively. 


Let  us  first  provide  some  brief  intuition  behind  our  al¬ 
gorithm,  which  will  serve  as  motivation  for  the  ensuing 
analy,sis  as  well.  At  each  node  v,  our  algorithm  con¬ 
siders  whether  to  leave  the  current  subtree  T*  or  to 
delete  it.  The  basis  for  this  comparison  must  clearly 
make  use  of  the  sample  S  provided.  Beyond  this  obser¬ 
vation,  a  number  of  ways  of  comparing  T*  to  the  best 
leaf  are  possible.  For  instance,  we  could  simply  prefer 
whichever  of  T‘  and  the  best  leaf  makes  the  smaller 
number  of  mistakes  on  Sy.  This  is  clearly  a  poor  idea, 
since  T*  cannot  do  worse  than  the  best  leaf  (assum¬ 
ing  majority  labels  on  the  leaves  of  T*),  and  may  do 
considerably  better  —  but  generalize  poorly  compared 
to  the  best  leaf  due  to  overfitting.  Thus,  it  seems  we 
should  penalize  T*  for  its  complexity,  which  is  exactly 
the  role  of  the  additive  term  a{my,Sy,£y,S)  above. 

One  important  and  natural  aspect  of  our  algorithm 
(and  many  commonly  used  pruning  methods)  is  the 
fact  that  the  comparison  between  T*  and  the  best  leaf 
is  being  made  entirely  on  the  basis  of  the  local  reduc¬ 
tion  to  the  observed  error.  That  is,  the  comparison 
depends  on  and  T*  only,  and  not  on  all  of  S  and  T. 
A  reasonable  alternative  “global”  comparison  might 
compare  the  observed  error  of  the  current  entire  tree, 
e(T*),  plus  a  penalty  term  that  depends  on  size{T*), 
with  the  observed  error  of  the  entire  tree  but  with  T* 
pruned,  e{T‘  -  T‘)  (where  T*  -  T’  is  the  tree  after 
we  prune  at  v),  plus  a  penalty  term  that  depends  on 
size{T*  —  Ty).  The  important  difference  between  this 
global  algorithm  and  ours  is  that  in  the  global  algo¬ 
rithm,  even  when  there  is  a  large  absolute  difference  in 
complexity  between  T‘  and  a  leaf,  this  difference  may 
be  swamped  by  the  fact  that  both  are  embedded  in  the 
much  larger  supertree  T*  —  that  is,  the  difference  is 
small  relative  to  the  complexity  of  T*.  This  may  cause 
a  suboptimal  insensitivity,  leading  to  a  propensity  to 
leave  large  subtrees  unpruned.  Indeed,  it  is  possible  to 
construct  examples  in  which  the  global  approach  leads 
to  primings  strictly  worse  than  those  produced  by  our 
algorithm,  and  demonstrating  that  results  as  strong  as 
we  will  give  arc  not  possible  for  the  global  method. 

Our  analysis  proceeds  as  follows.  We  first  need  to  ar¬ 
gue  that  any  time  our  algorithm  chooses  not  to  prune 
T*,  then  (with  high  probability)  this  was  in  fact  the 
“right”  decision,  in  the  sense  that  the  current  tree  T* 
would  be  degraded  by  deleting  T*.  This  allows  us  to 
establish  that  our  final  pruning  will  be  a  subtree  of 
the  optimal  pruning,  so  our  only  source  of  additional 
error  results  from  those  subtrees  of  this  optimal  prun¬ 
ing  that  we  deleted.  A  careful  amortized  analysis  al¬ 
lows  us  to  bound  this  additional  error  by  a  quantity 
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related  to  the  size  of  the  optimal  pruning.  This  line 
of  argument  establishes  a  relationship  between  the  er¬ 
ror  of  our  pruning  and  that  of  the  optimal  pruning; 
a  slight  modification  of  the  algorithm  and  a  more  in¬ 
volved  analysis  let  us  make  a  similar  comparison  to 
any  pruning.  This  extension  is  important  for  cases 
in  which  there  may  be  a  pruning  whose  error  is  only 
slightly  worse  than  that  of  the  optimal  pruning,  but 
whose  size  is  much  smaller.  In  such  a  case  our  bounds 
are  much  better. 


5  Local  Uniform  Convergence 

In  standard  uniform  convergence  results,  we  have  a 
class  of  events  (predicates),  and  we  prove  that  the  ob¬ 
served  frequency  of  any  event  in  the  class  does  not 
differ  much  from  its  true  probability.  We  would  like 
to  apply  such  results  to  events  of  the  form  “subtree 
T*  makes  an  error  on  x" ,  but  do  not  wish  to  take 
what  is  perhaps  most  obvious  approach  towards  doing 
so.  The  reason  is  that  we  want  to  examine  this  event 
conditioned  on  the  event  that  x  reaches  v,  and  obvi¬ 
ously  this  conditioning  event  differs  for  every  v.  One 
approach  would  be  to  redefine  the  class  of  events  of 
interest  to  include  the  conditioning  events,  that  is,  to 
look  at  events  of  the  form  “a:  satisfies  reachy{x)  and 
T*  makes  an  error  on  x"  for  all  possible  reach^  (x)  and 
T*.  It  turns  out  that  this  approach  would  result  in 
final  bounds  on  our  performance  that  are  significantly 
worse  than  what  we  will  obtain.  What  we  really  want 
is  the  rather  natural  notion  of  local  uniform  conver¬ 
gence:  for  any  conditioning  event  c  in  a  class  C,  and 
any  event  e  in  a  class  E,  we  would  like  to  relate  the 
observed  frequency  of  e  restricted  to  the  subsample  sat¬ 
isfying  c  to  the  true  probability  of  e  on  the  distribution 
conditioned  on  c;  and  clearly  the  accuracy  of  this  ob¬ 
served  frequency  will  depend  not  on  the  overall  sam¬ 
ple  size,  but  on  the  number  of  examples  satisfying  the 
conditioning  event  c.  Such  a  relationship  is  given  by 
the  next  two  theorems,  which  treat  the  cases  of  finite 
classes  and  infinite  classes  separately. 

Lemma  1  Let  C  and  H  be  finite  classes  of  boolean 
functions  over  X,  let  f  be  a  target  boolean  function 
over  X,  and  let  P  be  a  probability  distribution  over  X. 
For  any  c  £  C  and  h  £  H,  let  edh)  =  Prp[/i(x)  ^ 
f{x)\c{x)  =  1],  and  for  any  labeled  sample  S  of  f{x), 
let  ec{h)  denote  the  fraction  of  points  in  Sc  on  which 
h  errs,  where  Sc  =  {x  £  S  :  c{x)  =  1}.  Then  the 
probability  that  there  exists  a  c  £  C  and  an  h  £  H 


ProofiLet  us  fix  c  €  C  and  h  £  H.  For  these  fixed 
choices,  we  have  for  any  value  A 

Prp[|ee(/i)-ee(h)|  >  A]  = 

EmjPrs.[|ec(/i)-ec(/i)|  >  A]].  (9) 

Here  the  expectation  is  over  the  distribution  on  val¬ 
ues  of  me  induced  by  P,  and  the  distribution  on  Sc 
is  over  samples  of  size  me  (which  is  fixed  inside  the 
expectation)  drawn  according  to  Pc  (the  distribution 
P  conditioned  on  c  being  1).  Since  me  is  fixed,  by 
standard  Chernoff  bounds  we  have 

PrsJ|ee(/i)  -  eo(/i)l  >  A]  <  (10) 

giving  the  bound 

Prp[|ee(ft)  -ee(/i)|  >  A]  <  (11) 

If  we  choose 

^  /log(|g|)  +  log(|H|)  +  log(l/J) 

Y  me 

then  =  J/dCHiiri),  which  is  a  constant  in¬ 

dependent  of  me  and  thus  can  be  moved  outside 
the  expectation.  By  appealing  to  the  union  bound 
(Pr[A  V  J5]  <  Pr[^]  -I-  Pr[B]),  the  probability  that 
there  is  some  c  and  h  such  that  |ec(h)  —  €c(fi)|  >  A  is 
at  most  lC'||H'|(^/(|C'||i?|))  =  <5,  as  desired.  □ 

Our  use  of  Lemma  1  will  be  straightforward.  Suppose 
we  are  considering  some  node  u  in  a  decision  tree  T  at 
depth  ly,  and  with  a  subtree  T*  of  size  rooted  at  v. 
Then  we  will  appeal  to  the  lemma  choosing  the  condi¬ 
tioning  class  C  to  be  the  class  paths  (T,^t;),  choosing 
H  to  be  trees(T,  s„),  and  choosing  5  to  be  F Ivr?, 
where  (5'  is  the  overall  confidence  we  desire.  In  this 
case,  the  local  complexity  penalty  a(£y,Sv,my,S)  in 
Equation  (3)  and  the  deviation  A  in  Equation  (12)  co¬ 
incide,  and  thus  we  can  assert  that  with  probability 
1  -  S'/m^  there  is  no  leaf  of  depth  iy  and  subtree  of 
size  Sy  such  that  the  local  observed  error  of  the  sub¬ 
tree  deviates  by  more  than  a{my,Sy,£y,6)  from  the 
local  true  error.  By  summing  over  all  m^  choices  for 
ly  and  Sy,  we  obtain  an  overall  bound  of  <5'  on  the 
failure  probability. 
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In  other  words,  if  wc  limit  our  attention  to  the  lo¬ 
cal  errors  Cy  (generalization)  and  (observed),  then 
with  high  probability  we  can  assert  that  they  will  be 
within  an  amount  (namely,  a{mi,,Sy,£y,6))  that  de¬ 
pends  only  on  local  ciuantities:  the  local  sample  size 
rriy,  the  length  ly  of  the  path  leading  to  v,  and  the  size 
of  the  subtree  rooted  at  v. 

A  more  complicated  argument  is  needed  to  prove  local 
uniform  convergence  for  the  case  of  infinite  classes. 


Lemma  2  Let  C  and  H  be  classes  of  boolean  fxinc- 
tions  over  X ,  let  f  be  a  target  boolean  function  over  x, 
and  let  P  be  a  probability  distribution  over  X .  For  any 
c  £  C  and  h  £  H,  let  edh)  =  Prp[h{x)  7^  f{x)\c{x)  = 
1],  and  for  any  labeled  sample  S  of  f{x),  let  edh)  de¬ 
note  the  fraction  of  points  in  Sc  on  which  h  errs,  where 
Sc  =  {x  £  S  :  c(x)  =  1}.  Then  the  probability  that 
there  exists  a  c  £  C  and  an  h  £  H  such  that 


\edh)  -  edh)\  > 


'{dc  +  d//)log(2m)  -flog(l/(5) 


is  at  most  6,  where  my  =  15c|,  and  dc  and  dp  are  the 
VC  dimensions  of  C  and  H ,  respectively. 


Proof:(Sketch)  The  proof  closely  follows  the  “two- 
sample  trick”  proof  for  the  classical  VC  theorem  [9], 
with  an  important  variation.  Intuitively,  we  introduce 
a  “nested  two-sample  trick”,  since  we  need  to  apply 
the  idea  twice  —  once  for  C,  and  again  for  H. 

As  in  the  classical  proof,  we  define  two  events,  but 
now  they  are  “local”  events.  Event  A(S)  is  that  in  a 
random  sample  5  of  m  examples,  there  exists  a  c  G  C 
and  a,n  h  £  H  such  that  \edh)  -  Cc(h)|  >  A.  Event 
B{S,S')  is  that  in  a  random  sample  5  U  S'  of  2xn  ex¬ 
amples,  there  exists  a  c  G  C  and  a.n  h  £  H  such  that 
\edh)-e'dh)\  >  A/2,  where  edh)  and  e(.(/i)  denote  the 
observed  local  error  of  h  on  S  and  S',  respectively. 

We  use  the  fact  that 

Prs[A{S)] 

=  Prs.s'[A(S)]  (14) 

=  Prs,s'  [A{S)  A  B{S,  S')]/Prs,s'  [B{S,  S')|  A(S)] 

(15) 

Clearly,  Prs.s' [A(S)  A  P(S,  S')]  <  Pr5,s'[B(S,  S')]  . 
We  also  have  the  inequality  Prs,s'[B{S,  S')\A{S)]  > 
1/2.  Therefore,  Prs, S' [A(S)]  <  2Prs,s'[-B(S,  S')],  and 
we  can  concentrate  on  bounding  the  probability  of 
event  B. 


Let  us  first  consider  a  fixed  set  of  2m  inputs  xi , . . .  X2m  ■ 
The  number  of  possible  subsets  of  this  set  induced  by 


taking  intersections  with  sets  in  C  is  at  most  $e'(2;n,), 
where  is  the  dichotomy  counting  functions  of  clas¬ 
sical  VC  analysis.  Let  us  fix  a  c  G  C,  and  consider  the 
subset  Sc  of  Xi, . . .  ,X2m  that  fall  in  c;  let  m^  =  |5c|. 
Now  consider  all  possible  labelings  of  Sc  by  the  concept 
class  H\  there  are  at  most  ^{[{mc)  <  ^n{2m)  such  la¬ 
belings.  Let  us  now  also  fix  one  of  these  labelings,  by 
fixing  some  h  £  H. 

Now  both  c  G  C  and  h  £  H  are  fixed.  Consider  split¬ 
ting  Sc  randomly  into  two  subsets  5/  and  5/.  For 
event  B  to  hold,  we  need  the  difference  between  the 
observed  errors  of  h  on  5/  and  Sy  to  be  at  least  A/2. 
It  can  be  shown  that  this  will  occur  with  probability 
at  most  m^/12^  where  the  probability  is  taken  only 
over  the  random  partitioning  of  Sy  Now  if  we  choose 

/ {dc  +  dji)  log(2m)  -f  log(l/(i) 
y  my 

then  _  (i/(2m)'^‘^)(l/(2?».)'*")(5,  which  is 

independent  of  my.  We  can  then  bound  the  probabil¬ 
ity  that  this  event  occurs  for  some  c  and  h  by  summing 
this  bound  over  all  possible  subsets  Sy,  and  all  possi¬ 
ble  labelings  of  Sy  by  functions  in  H,  giving  a  bound 
of  $c(2m)$//(2m)(l/m‘^'^)(l/m‘^")(5.  Using  the  fact 
that  4>c(m)  <  m'^'^  and  <^i{{m)  <  m'^"  yields  an  over¬ 
all  bound  of  6,  as  desired.  □ 

6  Analysis  of  the  Pruning  Algorithm 

In  this  section,  we  apply  the  tools  of  local  uniform 
convergence  to  analyze  the  pruning  algorithm  given  in 
Section  4.  As  mentioned  earlier,  for  simplicity  in  expo¬ 
sition,  we  will  limit  our  attention  to  the  common  case 
in  which  A"  is  the  boolean  hypercube  {0, 1}”  and  the 
class  T  of  allowed  node  tests  is  just  the  input  variables 
Xi,  in  which  case  the  pruning  rule  used  by  our  bottom- 
up  algorithm  is  that  given  by  Equation  (6).  However, 
it  should  be  clear  how  the  analysis  easily  generalizes 
to  the  more  general  algorithms  given  by  Equations  (3) 
and  (7). 

We  shall  first  give  an  analysis  that  compares  the  gen¬ 
eralization  error  of  the  pruning  T*  produced  by  our 
algorithm  from  5  and  T  to  the  generalization  error  of 
Topt>  the  pruning  of  T  that  minimizes  the  generaliza¬ 
tion  error.  Recall  that  we  use  T‘  to  denote  the  subtree 
that  is  rooted  at  node  v  of  T  at  the  time  our  algorithm 
decides  whether  or  not  to  prune  at  v,  which  may  be  a 
subtree  of  T„  due  to  primings  that  have  already  taken 
place  below  v. 

We  will  show  that  e{T*)  is  larger  than  Cop«  =  e{T„pi) 
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by  an  amount  that  can  be  bounded  by  a  function  of 
the  size  Sopt  and  depth  iopt  of  T^pt.  Thus,  if  there  is 
a  reasonably  small  subtree  of  T  with  small  generaliza¬ 
tion  error,  our  algorithm  will  produce  a  pruning  with 
small  generalization  error  as  well.  In  Section  7,  we 
will  improve  our  analysis  to  compare  the  error  of  T* 
to  that  of  any  pruning,  and  provide  a  discussion  of  sit¬ 
uations  in  which  this  result  may  be  considerably  more 
powerful  than  our  initial  comparison  to  Topt  alone. 

For  the  analysis,  it  will  be  convenient  to  introduce  the 
notation 


rv  =  (4  +  Sy)  log(n)  -I-  log{m/S)  (17) 

for  any  node  v,  where  is  the  depth  of  v  in  T,  and 
Sy  is  the  size  of  T*.  In  this  notation,  the  penalty 
a{my,Sy,  ly,  S)  glveu  by  Equation  (6)  is  simply  a  con¬ 
stant  (that  we  ignore  in  the  analysis  for  ease  of  exposi¬ 
tion)  times  yjry/my.  (We  assume  that  ^ry/my  <  1, 
since  a  penalty  which  is  larger  than  1  can  be  modified 
to  a  penalty  of  1  without  changing  the  results.) 

Lemma  3  With  probability  at  least  1  — <5  over  the  draw 
of  the  input  sample  S,  T*  is  a  subtree  of  Topt- 

ProofiConsider  any  node  v  that  is  a  leaf  in  Tgpt-  It 
suffices  to  argue  that  our  algorithm  would  choose  to 
prune  T*,  the  subtree  that  remains  at  v  when  our  al¬ 
gorithm  reaches  v.  By  Equation  (6),  our  algorithm 
would  fail  to  prune  T*  only  if  et,(0)  exceeded  e„(T„*) 
by  at  least  the  amount  a{my,Sy,iy,S),  in  which  case 
Lemma  1  ensures  that  tyiT*)  <  et,(0)  with  high  prob¬ 
ability.  In  other  words,  if  our  algorithm  fails  to  prune 
Ty,  then  Topt  would  have  smaller  generalization  error 
by  including  T*  rather  than  making  v  a  leaf.  This 
contradicts  the  optimality  of  Topt-  n 

Lemma  3  means  that  the  only  source  of  additional  er¬ 
ror  of  T*  compared  to  Topt  is  through  overpruning,  not 
underpruning.  Thus,  for  the  purposes  of  our  analysis, 
we  can  imagine  that  our  algorithm  is  actually  run  on 
Topt  rather  than  the  original  input  tree  T  (that  is,  the 
algorithm  is  initialized  starting  at  the  leaves  of  Tgpt, 
since  we  know  that  the  algorithm  will  prune  everything 
below  this  frontier). 

Let  V  =  {ui, . . .  ,Uf}  be  the  sequence  of  nodes  in  Topt 
at  which  the  algorithm  chooses  to  prune  the  subtree 
T*.  rather  than  to  leave  it  in  place;  note  that  t  <  Sopt- 
Then  we  may  express  the  additional  generalization  er¬ 
ror  e{T*)  -  topt  as 

t 

<T*)  -  topt  =  Y.^eyM  -  ey,{T:^))py,  (18) 

t=l 


where  py.  is  the  probability  under  the  input  distribu¬ 
tion  P  of  reaching  node  Vi,  that  is,  the  probability  of 
satisfying  the  path  predicate  reachy. .  Each  term  in  the 
summation  of  Equation  (18)  simply  gives  the  change 
to  the  global  error  incurred  by  pruning  T*. ,  expressed 
in  terms  of  the  local  errors.  Clearly  the  additional 
error  of  T*  is  the  sum  of  all  such  changes. 

Now  we  may  write 


e{T*)  -  Copt 

t 

<  E  )i 

i=l 

+|£%(r;..)-e,,(T;j|)p„,  (19) 

^  I (4;  +  1)  log(n)  -h  log(m/(5) 

t=i  VV 

■Pa{my.,Sy.,ly.,S) 


'  (4.-  +  Sy. )  log(n)  -I-  log(m/ J) 

m,,, 


Pvi 


(20) 


(21) 


The  first  inequality  comes  from  the  triangle  inequal¬ 
ity.  The  second  inequality  uses  two  invocations  of 
Lemma  1,  and  the  fact  that  our  algorithm  directly 
compares  et,i(0)  and  iviiT*.),  and  prunes  only  when 
they  differ  by  less  than  a(m„,. ,  Sy. , iy- ,  5). 

Thus,  we  would  like  to  bound  the  sum  A  = 
/m„;  )py. .  The  leverage  we  will  eventually 
use  is  the  fact  that  Ei=i  bounded  by  quan¬ 

tities  involving  only  the  tree  Tgpt,  since  all  of  the  T*. 
are  disjoint  subtrees  of  Topt-  First  it  will  be  convenient 
to  break  this  sum  into  two  sums  —  one  involving  just 
those  terms  for  which  py^  is  “small” ,  and  the  other  in¬ 
volving  just  those  terms  for  which  py.  is  “large” .  The 
advantage  is  that  for  the  large  py^ ,  we  can  relate  py^  to 
its  empirical  estimate  py.  =  my.fm,  as  evidenced  by 
the  following  lemma. 


Lemma  4  The  probability,  over  the  sample  S,  that 
there  exists  a  node  Vi  E  V  such  that  > 
121og(t/(5)/m  butpy.  >  2py.,  is  at  most  6- 


ProofiWe  will  use  the  relative  Chernoff  bound 

Pr[p„,  <(l-7)p.J<e— (22) 

which  holds  for  any  fixed  Vi-  By  taking  7  =  1/2  and 
applying  the  union  bound,  we  obtain 

PrfBui  eV:py>  2py]  <  .  (23) 
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Now  we  can  use  the  assumed  lower  bound  on  to 
bound  the  probability  of  the  event  by  <5.  □ 


Let  V  be  the  subset  of  V  for  which  the  lower  bound 
Py.  >  12\og{t/6)/m  holds.  We  divide  the  sum  that 
describes  A  into  two  parts: 

A=  (\Av/^)pk.  + 

y.^V-V  Vi€V' 

(24) 

The  first  sum  is  bounded  by  12\og{s opt /d)s opt /m, 
since  y/r^Jrriy^^  is  at  most  1,  and  t  <  Sopi- 


For  the  second  sum,  we  perform  a  maximization.  By 
Lemma  4,  with  high  probability  we  have  that  for  ev¬ 
ery  Vi  e  V ,  Py.  <  2p„;  =  2myjm.  Thus,  with  high 
probability  we  have 


Vi€V' 


Vi€V' 


(25) 

(26) 


(27) 


To  bound  this  last  expression,  we  first  bound 
’’’vi-  Recall  that 

Tvi  =  +  Sy, )  log(n)  -b  log(m/(5).  (28) 


Since  for  any  Vi  G  V,  we  have  Cy^  <  iopti  we  have  that 
^opt^opt)  since  jV*  |  ^  ^  ^opt-  Since  the 
subtrees  T*.  that  we  prune  are  disjoint  and  subsets  of 
the  optimal  subtree  Tgpt,  we  have  —  ^opt- 

Thus 


Y  Sopt((l  +  iopt)^og{n)  +  log(m/(5)).  (29) 

Vi^V' 


To  bound  Y^y.^v''^'<i  Equation  (27),  we  observe 
that  since  the  sets  of  examples  that  reach  different 
nodes  at  the  same  depth  in  the  tree  are  disjoint,  we 
have  niy.  <  miopt-  Thus,  with  probability  1  - 

S,  we  obtain  an  overall  bound 

A  <  121og(sop«/(5)^ 

+  —  ^^Sopt{{l  +  ^op()  log(”)  +  \og{m/6)){mtopt) 

(30) 

=  C)^^log(Sop(/J)  -b  ^op(\/log(n)  +  log(77!/(f)^  X 


(31) 


This  gives  the  first  of  our  main  results. 

Theorem  5  Let  S  be  a  random  sample  of  size  vi 
drawn  according  an  unknown  target  function  and  in¬ 
put  distribution.  Let  T  =  T{S)  be  any  decision  tree, 
and  let  T*  denote  the  subtree  ofT  output  by  our  prun¬ 
ing  algorithm  on  inputs  S  and  T.  Let  Copt  denote  the 
smallest  generalization  error  among  all  subtrees  of  T, 
and  let  Sgpt  and  Eopt  denote  the  size  and  depth  of  the 
subtree  achieving  Copt-  Then  with  probability  1  —(5  over 

S, 

e(r*)  -  e„pt 

=  O  ^  ^log(sop(  /S)  -b  iopt  \/log(n)  +  log(m/(5)  j  x 

yjsoptln^  (32) 

7  An  Index  of  Resolvability  Result 

Roughly  speaking.  Theorem  5  ensures  that  the  true  er¬ 
ror  of  the  pruning  found  by  our  algorithm  will  be  larger 
than  that  of  the  best  possible  pruning  by  an  amount 
that  is  not  mueh  worse  than  \/sopt/m  (ignoring  log¬ 
arithmic  and  depth  faetors  for  simplicity).  How  good 
is  this?  Since  we  assume  that  T  itself  (and  therefore, 
all  subtrees  of  T)  may  have  been  constructed  from  the 
sample  S,  standard  model  selection  analyses  [10]  indi¬ 
cate  that  Copt  may  be  larger  than  the  error  of  the  best 
decision  tree  approximation  to  the  target  function  by 
an  amount  growing  like  y/sopt/m.  (Recall  that  fop( 
is  only  the  error  of  the  optimal  .subtree,  of  T  —  there 
may  be  other  trees  which  are  not  subtrees  of  T  with 
error  less  than  Copt,  especially  if  T  was  constructed  by 
a  greedy  top-down  heuristic.)  Thus,  if  we  only  com¬ 
pare  our  error  to  that  of  Topt,  we  are  effectively  only 
paying  an  additional  penalty  of  the  same  order  that 
Topi  pays.  If  Sopi  is  small  compared  to  m  —  that  is, 
the  optimal  subtree  of  T  is  small  —  then  this  is  quite 
good  indeed. 

But  a  stronger  result  is  possible  and  desirable.  Sup¬ 
pose  that  Topi  is  not  particularly  small,  but  that  there 
is  a  much  smaller  subtree  T'  whose  error  is  not  much 
w'orse  than  Copt-  In  such  a  case,  we  would  rather  claim 
that  our  error  is  close  to  that  of  T',  with  a  penalty 
that  goes  only  like  yjs' fm.  This  was  the  index  of  re¬ 
solvability  criterion  for  model  selection  first  examined 
for  density  estimation  by  Barron  and  Cover  [1],  and 
we  now  generalize  our  main  result  to  this  setting. 

Theorem  6  Let  S  be  a  random  sample,  of  size  m 
drawn  according  an  unknown  target  function  and  input 


m 
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distribution.  Let  T  =  T{S)  be  any  decision  tree,  and 
let  T*  denote  the  subtree  of  T  output  by  our  pruning 
algorithm  on  inputs  S  and  T.  Then  with  probability 
1  —  ^  over  S,  ' 

e{T*) 

-  +  0^^1og(se;y(T')/(5)+ 

^e#(T')\/log(n)  +  log(m/(5)^  ^Seff{T')lm^  | . 

(33) 

Here  the  min  is  taken  over  all  subtrees  T'  of  T,  and 
we  define  the  “effective”  size 

s,ff{T')  =  s' +  2m(e(T')  -  Copt)  +  6s'  log(s'/^)  (34) 

and  the  “effective”  depth  ieff{T')  =  min{^opt,  s'}, 
where  s'  and  P  are  the  size  and  depth  of  T' ,  Copt  de¬ 
notes  the  smallest  generalization  error  among  all  sub¬ 
trees  of  T,  and  ippt  denotes  the  depth  of  the  subtree 
achieving  Copt- 

The  proof  is  omitted  due  to  space  considerations,  but 
the  main  difference  from  the  proof  of  Theorem  5  is  that 
our  pruning  is  no  longer  a  subtree  of  the  pruning  T' 
to  which  it  is  being  compared.  This  requires  a  slight 
modification  of  the  pruning  penalty  a(mt,,s„,^p,(5), 
and  the  analysis  bounding  the  sum  of  the  sizes  of  the 
pruned  subtrees  becomes  more  involved. 

Again  ignoring  logarithmic  and  depth  factors  for  sim¬ 
plicity,  Theorem  6  compares  the  error  of  our  pruning 
simultaneously  to  all  prunings  T' .  Our  additional  er¬ 
ror  goes  roughly  like  yJseff{T')lm.  In  Equation  (34), 
if  s'  is  small  compared  to  m  and  e(T')  is  not  much 
larger  than  Copt,  then  the  bound  shows  that  our  error 
will  compare  well  to  topt  —  even  though  the  tree  T' 
achieving  the  min  may  not  be  Tgpt.  This  is  the  power 
of  the  guarantee  provided  by  index  of  resolvability  re¬ 
sults. 
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Abstract 

We  present  an  analysis  of  actor/critic  algo¬ 
rithms,  in  which  the  actor  updates  its  policy 
using  eligibility  traces  of  the  policy  parame¬ 
ters.  Most  of  the  theoretical  results  for  eligi¬ 
bility  traces  have  been  for  only  critic’s  value 
iteration  algorithms.  This  paper  investigates 
what  the  actor’s  eligibility  trace  does.  The 
results  show  that  the  algorithm  is  an  exten¬ 
sion  of  Williams’  REINFORCE  algorithms 
for  infinite  horizon  reinforcement  tasks,  and 
then  the  critic  provides  an  appropriate  re¬ 
inforcement  baseline  for  the  actor.  Thanks 
to  the  actor’s  eligibility  trace,  the  actor  im¬ 
proves  its  policy  by  using  a  gradient  of  ac¬ 
tual  return,  not  by  using  a  gradient  of  the 
estimated  return  in  the  critic.  It  enables  the 
agent  to  learn  a  fairly  good  policy  under  the 
condition  that  the  approximated  value  func¬ 
tion  in  the  critic  is  hopelessly  inaccurate  for 
conventional  actor/critic  algorithms.  Also,  if 
an  accurate  value  function  is  estimated  by  the 
critic,  the  actor’s  learning  is  dramatically  ac¬ 
celerated  in  our  test  Ccises.  The  behavior  of 
the  algorithm  is  demonstrated  through  simu¬ 
lations  of  a  linear  quadratic  control  problem 
and  a  pole  balancing  problem. 


1  Introduction 

Actor/critic  architecture  is  an  adaptive  version  of  pol¬ 
icy  iteration  [Kaelbling  et  al.96].  In  general,  policy 
iteration  alternates  two  phases;  a  policy  evaluation 
phase  and  a  policy  improvement  phase.  The  actor  im¬ 
plements  a  stochastic  policy  that  maps  from  a  repre¬ 
sentation  of  a  state  to  a  probability  distribution  over 
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gineering,  Tokyo  Institute  of  Technology,  4259  Nagatsuta 
Midori-ku  Yokohama  226-8502  JAPAN. 


actions.  The  critic  attempts  to  estimate  the  evaluation 
function  for  the  current  policy.  The  actor  improves  its 
control  policy  using  critic’s  temporal  difference  (TD) 
as  an  effective  reinforcement.  In  many  cases,  the  policy 
improvement  is  executed  concurrently  with  the  policy 
evaluation,  because  it  is  not  feasible  to  wait  for  the 
policy  evaluation  to  converge. 

The  actor/critic  algorithms  have  been  success¬ 
fully  applied  to  a  variety  of  delayed  reinforcement 
tasks;  ASE/ACE  architecture  for  a  pole  balancing 
[Barto  et  al.  83}  [Gullapalli  92],  RFALCON  for  a  pole 
balancing  and  for  control  of  a  ball-beam  system 
[Lin  et  al.  96],  a  cart-pole  swing-up  task  [Doya96]. 
Although  convergence  proofs  for  the  actor/critic  algo¬ 
rithms  (e.g.  [Williams  et  al.  90]  and  [Gullapalli  92]) 
are  less  than  value-iteration  based  algorithms  such 
as  Q-learning  [Watkins  et.al  92],  the  actor/critic  algo¬ 
rithms  have  the  following  practical  advantages. 

•  It  is  easy  to  implement  multidimensional  contin¬ 
uous  action,  that  is  often  mixed  with  discrete  ac¬ 
tion  [Gullapalli  92].  Because  the  actor  selects  ac¬ 
tion  %  its  stochastic  policy,  therefore  problems  of 
action  selection  like  as  Q-learning  does  not  exist. 
The  Q-learning  needs  to  estimate  returns  for  all 
state-action  pairs,  but  the  critic  would  estimate 
only  the  return  of  each  state. 

•  Memory-less  stochastic  policies  can  be  con¬ 
siderably  better  than  memory-less  determinis¬ 
tic  policies  in  the  case  of  partially  observable 
Markov  decision  processes  (POMDPs)  [Singh  94] 
[Jaakkola  94]  or  multi-player  games  [Littman  94]. 

•  It  is  eeisy  to  incorporate  an  expert’s  knowledge 
into  the  learning  system  by  applying  conven¬ 
tional  supervised  learning  techniques  to  the  actor 
[Clouse  et  al.  92]. 

Eligibility  traces  are  a  fundamental  mechanism 
that  has  been  widely  used  to  handle  delayed 
reward  [Singh  96].  Also  the  traces  are  often 
used  to  overcome  non-Markovian  effects  [Sutton  95], 
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[Pendrith  et  al.  96].  In  Barto,  Sutton  and  Anderson’s 
ASE/ACE  architecture,  both  the  critic  and  the  actor 
make  use  of  the  eligibility  trace.  Theoretical  results  of 
eligibility  traces  in  the  context  of  TD(A)  [Sutton  88] 
have  been  obtained.  But,  in  actor/critic  algorithms, 
the  effect  of  the  actor’s  trace  has  not  been  investigated. 
This  paper  presents  an  analysis  of  an  actor/critic  al¬ 
gorithm,  in  which  the  actor  improves  its  policy  using 
eligibility  traces  of  the  policy  parameters.  This  may 
be  the  first  analysis  of  the  actor’s  eligibility  traces. 


2  Discounted  Reward  Criteria 


At  each  discrete  time  t,  the  agent  observes  xt  contain¬ 
ing  information  about  its  current  state,  select  action 
at,  and  then  receives  an  instantaneous  reward  rt  re¬ 
sulting  from  state  transition  in  the  environment.  In 
general,  the  reward  and  the  next  state  may  be  ran¬ 
dom,  but  their  probability  distributions  are  assumed 
to  depend  only  on  xt  and  at  in  Markov  decision  pro¬ 
cesses  (MDPs),  in  which  many  reinforcement  learning 
algorithms  are  studied.  The  objective  of  reinforcement 
learning  is  to  construct  a  policy  that  maximizes  the 
agent’s  performance,  A  natural  performance  measure 
for  infinite  horizon  tasks  is  the  cumulative  discounted 
reward: 

00 

Vt  =  Y!tT'‘^t+k  ,  (1) 

k=0 

where  the  discount  factor,  0  <  7  <  1  specifies  the 
importance  of  future  rewards.  Vt  is  called  the  actual 
return,  that  specifies  how  good  the  reward  sequence 
after  time  t  is.  By  this  notation,  the  goal  of  the  learn¬ 
ing  is  to  maximize  the  expected  return.  In  MDPs,  the 
expected  return  can  be  defined  for  all  states  as: 


V^{x)  =  E, 


.k=0 


(2) 


where  Ejr  denotes  the  expectation  assuming  the  agent 
always  uses  stationary  policy  tt.  V"^{x)  is  called  the 
value  function,  that  specifies  how  good  the  given  state 
X  is.  In  MDPs,  the  goal  of  the  learning  is  to  find  an 
optimal  policy  that  maximizes  the  value  of  each  state 
X  defined  by  Equation  2.  Although  similar  value  func¬ 
tions  can  be  given  in  POMDPs,  difficulties  to  define 
the  optimum  have  pointed  out  in  [Singh  94]. 


3  Actor/Critic  Algorithms 

Figure  1  and  2  give  an  overview  of  actor/critic  algo¬ 
rithms  [Sutton  90]  [Crites  et  al.  94].  There  are  many 
ways  to  implement  the  policy  and  its  updating  scheme 
in  the  actor.  The  algorithms  for  the  critic  are  mostly 
TD  methods.  We  should  notice  the  following  two 
points;  one  is  the  actor  implements  stochastic  policy, 


the  other  is  the  actor  improves  its  policy  using  TD- 
error.  This  paper  especially  investigates  an  algorithm 
for  the  actor. 


Figure  1:  A  generic  actor/critic  framework. 


1.  The  agent  observes  xt  in  the  environment,  and  the 
actor  executes  action  at  according  to  the  current 
stochastic  policy  tt. 

2.  The  critic  receives  the  immediate  reward  rt,  and  then 
observes  the  resulting  next  state  it+i-  The  critic  pro¬ 
vides  TD  error  as  an  useful  reinforcement  feedback  to 
the  actor,  according  to 

(TD-error)  =  [rt -b  7k^(2:t+i)]  -V(xt)  , 

where  0  <  7  <  1  is  the  discount  factor,  P(x)  is  an 
estimated  value  function  by  the  critic. 

3.  The  actor  updates  the  stochastic  policy  using  the  TD- 
error.  If  (TD-error)  >  0,  action  at  performed  rela¬ 
tively  good  and  its  probability  should  be  increased.  If 
(TD-error)  <  0,  action  at  performed  relatively  poorly 
and  its  probability  should  be  decreased. 

4.  The  critic  updates  estimated  value  function  V (x)  ac¬ 
cording  to  TD  methods,  e.g.,  TD(0)  algorithm  adjusts 
P(xt)  <—  V(xt)+a  (TD-error),  where  a  is  the  learning 
rate. 

5.  Go  to  step  1. 


Figure  2:  Main  loop  of  the  generic  actor/critic  algo¬ 
rithm. 
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4  Adding  Eligibility  Trace  to  the 
Actor 

4.1  Function  Approximation  for  Stochastic 
Policies 

In  this  paper,  n(a,  W,  x)  denotes  probability  of  select¬ 
ing  action  a  under  the  policy  tt  in  the  observation  x. 
The  7r(a,lT,  A)  is  taken  to  be  a  probability  density 
function  when  the  set  of  possible  action  is  continu¬ 
ous.  The  policy  is  represented  by  a  parametric  func¬ 
tion  approximator  using  the  internal  variable  vector 
W.  The  agent  can  improve  the  policy  tt  by  modifying 
W.  For  example,  W  corresponds  to  synaptic  weights 
where  the  action  selecting  probability  is  represented  by 
neural  networks,  or  W  means  weight  of  rules  in  clas¬ 
sifier  systems.  The  advantage  of  using  the  notation 
of  the  parametric  function  7r()  is  that  computational 
restriction  and  mechanisms  of  the  agent  can  be  spec¬ 
ified  simply  by  a  form  of  the  function,  and  then  we 
can  provide  a  sound  theory  of  learning  algorithms  for 
arbitrary  types  of  the  actor. 


4.2  Details  of  the  Algorithm 

Figure  3  specifies  the  actor/critic  algorithm  that  uses 
the  eligibility  trace  in  the  actor.  The  ASE/ACE  sys¬ 
tem  configured  for  pole-balancing  [Barto  et  al.  83]  is 
just  an  instance  of  this  algorithm.  The  actor’s  eligibil¬ 
ity  in  step  3  is  the  same  variable  defined  in  Williams’ 
REINFORCE  algorithms  [Williams  92].  The  eligibil¬ 
ity  ei{t)  specifies  a  correlation  between  the  associated 
policy  parameter  Wi  and  the  executed  action  at-  The 
eligibility  trace  D,(t)  is  a  discounted  running  average 
of  eligibility.  It  accumulates  the  agent’s  history.  When 
a  positive  reinforcement  is  given,  the  actor  updates  W 
so  that  the  probability  of  actions  recorded  in  the  his¬ 
tory  is  increased.  It  means  the  TD-error  at  the  time 
t  affects  not  only  the  action  at  but  also  at-i,  at-2,  ■  ■  ■■ 
At  first  glance,  this  idea  is  senseless  for  improving  the 
policy,  but  it  has  very  interesting  features  given  in  de¬ 
tail  later.  Note  that  the  algorithm  shown  in  Figure  3  is 
identical  to  a  stochastic  gradient  ascent  for  discounted 
reward  [Kimura  et  al.  97]  when  the  actor’s  discount 
factor  /3  =  y  and  the  V{x)  in  the  critic  equals  a  con¬ 
stant  b  for  all  observations. 

The  actor  requires  a  memory  to  implement  W  for  the 
policy  and  to  implement  Di  for  the  eligibility  trace. 
The  amount  of  the  memory  for  Di  is  equal  to  W’s. 


4.3  An  Analysis  of  the  Algorithm 

Assume  that  the  actor’s  discount  factor  /?  equals  7, 
and  for  all  <  <  0,  A(0  =  0,  then  the  algorithm  shown 


1.  The  agent  observes  i(,  and  the  actor  executes  action 
at  with  probability  7r(at,  W,  it). 

2.  The  critic  receives  the  immediate  reward  Tt,  and  then 
observes  the  resulting  next  state  Xt+i .  The  critic  pro¬ 
vides  TD  error  to  the  actor  according  to 

(TD-error)  =  [r,  +  7 1/(i,+  i )]  -F(ii)  ,  (3) 

where  0  <  7  <  1  is  the  discount  factor,  V{x)  is  an 
estimated  value  function  by  the  critic. 

3.  The  actor  updates  the  stochastic  policy  using  the  TD- 
error  according  to: 


Eligibility: 

e,{t) 

=  ^ln(^(a.,W,iO) 

Eligibility  Trace: 

D.{t) 

=  e,(<)  +  ^D,{i  -  1)  , 

Aw,{t) 

=  (TD-error)  Z)j(<) 

W 

W  +  apAW(t), 

where  tu,  denotes 

the  i“' 

component  of  W,  e,  and 

Di  are  the  associated  eligibility  and  eligibility  trace 
respectively,  /?(0<^<l)isa  discount  factor  for  the 
eligibility  trace,  Op  is  the  learning  rate  for  the  actor. 

4.  The  critic  updates  estimated  value  function  F(x)  ac¬ 
cording  toTD  methods,  e.g.,  TD(0)  algorithm  adjusts 
F(i)  «—  V(i) -h  a  (TD-error),  where  a  is  the  learning 
rate. 

5.  Go  to  step  1. 


Figure  3:  The  actor/critic  algorithm  adding  the  eligi¬ 
bility  trace  to  the  actor. 


t=o 


in  Figure  3  updates  the  policy  parameters  as: 

00 

^Am.(<) 

00 

Y,  +T^(2;<+i)  -  V{xt))  Di{t) 

t=zO 

00  /  ^  \ 

Y,  ('■<  +  -  ^(*0)  ( 

t=0  \t=0  / 

00/00  ' 

-|-7F(x^+i)-  1/(x^)) 


t  =  0 


KT=t 


00 

Yeiit)  {Vt-V{xt)) 


V{xt) 


(4) 

(5) 


Equation  5  is  given  by  Equation  1  and  4.  Here  we  as¬ 
sume  that  the  statistics  of  the  random  variable  V)  de¬ 
pends  only  on  the  current  policy  parameter.  It  means 
E{Vt}  is  a  deterministic  function  of  W,  where  E  de- 
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notes  the  expectation  operator.  This  assumption  may¬ 
be  right  if  the  policy  is  converged  to  an  equilibrium 
point.  The  critic’s  estimation  V{xt)  is  obviously  inde¬ 
pendent  of  the  action  at  the  time  t.  From  the  theory 
of  Williams’  REINFORCE  algorithm  [Williams  92], 
the  value  Vt  and  V{xt)  in  Equation  5  can  be  seen 
as  a  reinforcement  signal  and  a  reinforcement  base¬ 
line  respectively,  then  we  have  E{ei(t)  (V/  —  i^(a;t))}  = 
{d/dwi)E{Vt}.  It  says  that  the  algorithm  updates  pol¬ 
icy  parameters  statistically  in  a  direction  for  increasing 
the  actual  return  Vt,  not  in  a  direction  of  a  gradient 
of  estimated  value  function  in  the  critic.  Also  It  can 
be  seen  as  an  extension  of  reinforcement  comparison 
methods  [Sutton  et  al.  98],  then  corresponds  to 

the  reference  reward. 

From  the  above  analysis  and  Figure  3,  we  can  ex¬ 
plain  what  the  actor’s  eligibility  trace  does.  At  the 
time  t,  the  algorithm  reinforces  at  using  TD  error 
rt  -h  y(a:t4.i)  —  V{xt)  as  a  temporary  expedient,  there¬ 
after  the  actor’s  eligibility  trace  replaces  F(xt+i)  with 
the  actual  return  (rt+i  +  7^+2  +  •  •  •)  iii  order. 

The  critic  does  not  affect  the  direction  of  the  average 
update  vector,  because  the  critic  works  as  a  reinforce¬ 
ment  baseline.  Therefore,  the  actor  can  improve  its 
policy,  whether  the  critic  is  able  to  learn  the  value 
function  or  not.  If  the  critic  approximates  the  value 
function  well,  the  actor’s  learning  would  be  acceler¬ 
ated. 

The  above  results  are  under  the  special  condition  ^  = 
7.  If  /?  =  0,  the  actor  updates  W  in  the  direction 
of  the  gradient  of  the  approximated  value  function  in 
the  critic.  The  P  {0  <  0  <  y)  interpolates  between 
the  above  two  limiting  cases.  The  characteristics  of 
the  0  are  similar  to  the  A  in  TD(A)  [Sutton  88]  and 
Q(A)-learning  [Peng  et  al.  94]. 

5  Preliminary  Experiments 

This  section  demonstrates  the  performance  of  the  al¬ 
gorithm  applying  to  a  simple  linear  control  problem. 


5.1  A  Linear  Quadratic  Regulator  (LQR) 

The  following  linear  control  problem  can  serve  as  a 
benchmark  of  delayed  reinforcement  tasks  [Baird  94]. 
At  a  given  discrete-time  t,  the  state  of  the  environ¬ 
ment  is  the  real  value  xt .  The  agent  chooses  a  control 
action  at  that  is  also  real  value.  The  dynamics  of  the 
environment  is: 


*t+i  =  *4  +  Ot  +  rioise  ,  (6) 

where  the  noise  is  the  normal  distribution  that  follows 
the  standard  deviation  ffnaise  =  0.5.  The  immediate 


reward  is  given  by 

rt  =  -xf  -  af  .  (7) 

The  goal  is  to  maximize  the  total  discounted  reward, 
defined  by  Equation  1  or  2  for  all  x.  Because  the  task  is 
a  linear  quadratic  regulator  (LQR)  problem,  it  is  pos¬ 
sible  to  calculate  the  optimal  control  rule.  From  the 
discrete- time  Riccati  equation,  the  optimum  regulator 
is  given  by 

at  —  —ki  Xt  ,  where  fci  =  1 - ,  =. 

1-1-27-1-  ^472  +  1 

(8) 

The  optimum  value  function  is  given  by  V*{xt)  = 
—k2X^,  where  k2  is  a  some  positive  constant.  In  this 
experiment,  the  set  of  possible  states  is  constrained  to 
lie  in  the  range  [—4, 4] .  When  the  state  transition  given 
by  Equation  6  does  not  result  in  the  range  [—4, 4],  the 
Xt  is  truncated.When  the  agent  chooses  an  action  that 
is  not  lie  in  the  range  [—4,4],  the  action  executed  in 
the  environment  is  also  truncated. 


5.2  Implementation  for  the  LQR  Problem 
5.2.1  The  Actor 


Remember  the  policy  7r(a,  W,  X)  is  a  probability  den¬ 
sity  function  when  the  set  of  possible  action  is  con¬ 
tinuous.  The  normal  distribution  is  a  simple  multipa¬ 
rameter  distribution  for  a  continuous  random  variable. 
It  has  two  parameters,  the  mean  y  and  the  standard 
deviation  cr.  When  the  policy  function  tt  is  given  by 


the  equation  9,  the  eligibility  of  y  and  a  are 

7r(a,/4,cr)  = 

1  ,-{a-yf^ 

2<r2  ^ 

(9) 

e^  = 

at -y 

(T^ 

(10) 

6(7  = 

{at  -  yf  -  cr2 
cr3 

(11) 

One  useful  feature 

of  such  a  Gaussian 

unit 

[Williams  92]  is  that  the  agent  has  a  potential  to  con¬ 
trol  its  degree  of  exploratory  behavior.  We  must  draw 
attention  to  the  fact  that  the  eligibility  is  to  divergent 
when  a  goes  close  to  0,  because  the  parameter  cr  is 
occupying  the  denominators  of  Equation  10  and  11. 
The  divergence  of  the  eligibility  has  a  bad  influence  on 
the  algorithm.  One  way  to  overcome  this  problem  is 
to  control  the  step  size  of  the  update  parameter  vec¬ 
tor  using  cr.  It  is  obtained  by  setting  the  learning  rate 
parameter  proportional  to  cr^,  then  the  eligibility  can 
be  seen  as 


—  Qf  /i 


{at  -  y)^  -  <r^ 
cr 


(12) 


The  actor  would  first  compute  y  and  <r  deterministi¬ 
cally  and  then  draw  its  output  from  the  normal  dis¬ 
tribution  that  follows  mean  equal  to  y  and  standard 
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deviation  equal  to  cr.  The  actor  has  two  internal  vari¬ 
ables,  wi  and  W2,  and  computes  the  values  of  /i  and  a 
according  to 


=  WiXt 


1 

1  +  exp(-u;2)’ 


(13) 


Then,  Wi  can  be  seen  as  a  feedback  gain.  The  reason 
for  this  calculation  of  a  is  to  guarantee  the  c  to  keep 
positive.  The  ei  and  62  are  the  characteristic  eligibil¬ 
ities  of  wi  and  W2  respectively.  From  Equation  12,  ei 
and  62  are  given  by 

ei  =  (14) 

62  =  =  ((a(-/r)2-«r2)(l-<r)  .(15) 


The  wi  is  initialized  to  0.35  ±  0.15,  and  W2  =  0,  i.e., 
cr  =  0.5.  The  learning  rate  Op  is  fixed  to  0.001. 


5.2.2  The  Critic 

The  critic  quantizes  the  continuous  state-space  (—4  < 
a:  <  4)  into  an  array  of  boxes.  We  have  tried  two  types 
of  the  quantizing:  one  is  discretizing  x  evenly  into  3 
boxes,  the  other  is  10  boxes.  And  the  critic  attempts  to 
store  in  each  box  a  prediction  of  the  value  V  by  using 
TD(0)  [Sutton  88].  The  learning  rate  a  for  TD(0)  is 
fixed  to  0.2. 


using  the  trace  was  not  influenced  by  the  critic’s  abil¬ 
ity  in  terms  of  the  quality  of  the  mean  of  the  policy. 
We  can  also  see  this  property  in  Figure  8,  but  its  de¬ 
viation  is  considerably  large.  Figure  9  shows  the  value 
function  that  is  defined  by  Equation  1  and  7  over  the 
parameter  space  y  and  cr.  The  value  of  performance  is 
fairly  flat  around  the  optimal  solution.  This  is  the  rea¬ 
son  that  the  deviation  of  the  policy  is  large  in  Figure 
8.  This  example  makes  it  clear  that  the  critic  controls 
step-size  of  the  actor’s  backups  so  that  the  step-size  is 
taken  to  be  smaller  around  the  local  maximum. 

The  algorithm  in  Figure  7  achieved  best  results  in 
terms  of  both  the  mean  and  the  deviation  of  the  pol¬ 
icy.  The  reason  for  this  may  be  owing  to  the  critic’s 
perfect  value  estimation. 

In  this  preliminary  experiment,  we  can  see  that  the 
algorithm  using  the  actor’s  eligibility  trace  performed 
better  than  the  algorithm  without  using  the  trace  in 
the  same  computational  resources. 

Here  we  presented  the  results  of  the  actor-critic  that 
use  only  TD(0)  in  the  critic,  but  we  have  also  experi¬ 
mented  on  TD(A)  where  0  <  A  <  1.  Roughly  speaking, 
we  have  poor  performance  when  the  A  approaches  close 
to  1.  It  follows  from  this  that  the  eligibility  trace  in 
the  critic  cannot  make  up  for  the  critic’s  poor  ability 
of  function  approximation.  The  details  of  the  experi¬ 
ments  using  TD(A)  will  appear  in  other  papers. 


5.3  Simulation  Results 

Figure  4,  5,  6,  7  and  8  show  the  performance  of  100 
trials  in  the  LQR  problem  with  the  discount  rate  7  = 
0.9. 

Figure  4  shows  the  performance  of  the  algorithm,  in 
which  the  critic  uses  3  boxes,  the  actor  does  not  use 
eligibility  traces,  i.e,  ^  =  0.  Figure  6  shows  the  perfor¬ 
mance  where  the  critic  uses  10  boxes,  the  actor  does 
not  use  the  traces.  The  algorithm  in  Figure  6  con¬ 
verged  close  to  the  optimum  feedback  gain.  In  con¬ 
trast,  Figure  4  didn’t.  The  reason  for  this  is  that  the 
ability  of  the  function  approximation  (3  boxes)  is  in¬ 
sufficient  for  learning  policy  without  the  trace. 

Figure  5  shows  the  performance  where  the  critic  uses 
3  boxes,  the  actor  uses  the  trace,  /?  =  7  =  0.9.  It 
achieved  much  better  results  in  terms  of  both  the 
learning  efficiency  and  the  quality  of  the  mean  value  of 
the  converged  policy  than  the  algorithm  in  Figure  4  or 
5.  Obviously,  the  actor’s  eligibility  trace  relates  these 
two  advantages.  The  reason  for  the  learning  efficiency 
in  this  case  may  be  that  the  actor’s  trace  accelerates 
propagating  information.  The  better  quality  of  the 
policy  is  clearly  owing  to  the  property  that  the  actor 
improves  its  policy  by  using  a  gradient  of  actual  re¬ 
turn,  shown  in  Section  4.3.  Therefore,  the  algorithm 
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Learning  steps 

Figure  4:  The  average  performance  of  100  trials  with¬ 
out  the  actor’s  eligibility  trace  (/?  =  0).  The  critic  uses 
3  boxes. 


Actor/Critic  Algorithms  using  Eligibility  Traces  283 


Learning  steps 


0  500  1000  1500  2000  2500  3000  3500  4000  4500  5000 

Learning  steps 


Figure  5:  The  average  performance  of  100  trials  using 
the  actor’s  trace  /?  =  0.9.  The  critic  uses  3  boxes. 


Figure  7:  The  average  performance  of  100  trials  using 
the  actor’s  trace  /?  =  0.9.  The  critic  uses  10  boxes. 


Learning  steps 

Figure  8:  The  average  performance  of  100  trials.  ^  = 
Figure  6:  The  average  performance  of  100  trials  with-  0.9.  The  agent  learns  without  the  critic,  i.e.,  the  critic 
out  the  actor’s  trace  (/?  =  0).  The  critic  uses  10  boxes.  provides  7(a;)  =  0  for  all  x. 
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Opt imum  point 


Figure  9:  Value  function  over  the  parameter  space  in 
the  LQR  problem,  where  7  =  0.9.  It  is  fairly  fiat 
around  the  optimum;  y  =  -0.5884,  (7  =  0. 


Figure  10:  The  cart-pole  problem. 


arbitrary  range,  but  the  possible  action  in  the  cart-pole 
system  is  constrained  to  lie  in  the  range  [-20,20](N). 
When  the  agent  chooses  an  action  which  is  not  lie  in 
that  range,  the  action  executed  in  the  system  is  trun¬ 
cated.  The  system  begins  with  {x,  x,  0, 0)  =  (0, 0, 0,  0). 
The  system  fails  and  receives  a  reward  (penalty)  signal 
of  —  1  when  the  pole  falls  over  ±12  degrees  or  the  cart 
runs  over  the  bounds  of  its  track  (-2.4  <  x  <  2.4), 
then  the  cart-pole  system  is  reset  to  the  initial  state. 


6  Applying  to  a  Cart-Pole  Problem  6.2  Details  of  the  Agent 


The  behavior  of  this  algorithm  is  demonstrated 
through  a  computer  simulation  of  a  cart-pole  con¬ 
trol  task,  that  is  a  multi-dimensional  nonlinear  non¬ 
quadratic  problem.  We  modified  the  cart-pole  prob¬ 
lem  described  in  [Barto  et  al.  83]  so  that  the  action  is 
taken  to  be  continuous. 


In  this  experiment,  the  actor  adopts  similar  im¬ 
plementation  shown  in  Equation  9  and  12.  The 
state  space  is  constrained  in  the  range  {x,x,6,0)  = 
(±2.4  m, ±2  m/sec,  ±7r  x  12/180  rad,  ±1.5  rad/sec). 
The  actor  has  five  internal  variables  and 

computes  the  y  and  a  according  to 


6.1  Problem  Formulation 

The  dynamics  of  the  cart-pole  system  is  modeled  by 
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where  M  =  1.0  (kg)  denotes  mass  of  the  cart,  m  =  0.1 
(kg)  is  mass  of  the  pole,  2^  =  1  (m)  is  a  length  of 
the  pole,  g  =  9.8  {m/sec^)  is  the  acceleration  of  grav¬ 
ity,  F  (N)  denotes  the  force  applied  to  cart’s  center  of 
mass,  Pc  =  0.0005  is  a  coefficient  of  friction  of  cart, 
fip  =  0.000002  is  a  coefficient  of  friction  of  pole.  In 
this  simulation,  we  use  discrete-time  system  to  approx¬ 
imate  these  equations,  where  At  =  0.02  sec.  At  each 
discrete  time  step,  the  agent  observes  {x,x,9,6),  and 
controls  the  force  F.  The  agent  can  execute  action  in 


Xf  Xt  9t 

''  =  iviso 
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(16) 


Similarly  to  Equation  14  and  15,  the  eligibilities 
ei  •  •  -  65  are  given  by 


ei  =  {at  -  p)xt  ,  62  =  (at  -  p)xt 
ea  =  {at  -  y)0t  ,  64  =  {at  -  p)  9t 
es  =  ((ot  - /r)^  -  (7^)(1  ±  0.1  -  cr)  . 

The  critic  discretizes  the  normalized  state  space  evenly 
into  3x3x3x3  =  81  boxes,  and  attempts  to  store  in 
each  box  V  by  using  TD(0)  algorithm  [Sutton  88].  The 
parameters  are  set  to  7  =  0.95,  a  =  0.5,  Op  =  0.001. 


6.3  Simulation  Results 

Figure  11  shows  the  performance  of  three  learning 
algorithms  in  which  the  policy  representation  is  the 
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same.  The  actor/critic  algorithm  using  the  actor’s 
trace  achieved  best  results.  In  contrast,  the  algorithm 
without  using  the  trace  couldn’t  learn  the  control  pol¬ 
icy  because  of  the  poor  ability  of  function  approxima¬ 
tion  in  the  critic. 


Figure  11:  The  average  performance  of  three  algo¬ 
rithms  on  100  trials.  The  critic  uses  3  x  3  x  3  x  3 
boxes.  A  trial  means  an  attempt  from  initial  state  to 
a  failure. 

7  Discussion 

Representation  of  Policies:  First  of  all,  ac¬ 
tor/critic  algorithms  should  have  sufficient  ability  to 
approximate  policies.  If  it  is  satisfied,  use  of  the  ac¬ 
tor’s  eligibility  trace  (yd  =  7)  enables  to  learn  an  ac¬ 
ceptable  policy  with  less  cost  rather  than  increasing 
the  critic’s  ability  of  function  approximation  in  our 
test  cases.  The  reason  is  that  the  policy  function  rep¬ 
resentation  would  require  less  memory  than  the  rep¬ 
resentation  of  the  state-action  value  function  in  many 
cases. 

Controlling  Step-Size  of  Backups:  It  is 

analytically  shown  in  Section  4.3  that  the  critic  pro¬ 
vides  an  appropriate  reinforcement  baseline  to  the  ac¬ 
tor.  The  adaptive  baseline  controls  step-size  of  the 
actor’s  backups  so  that  the  step-size  is  taken  to  be 
smaller  around  the  local  maximum.  This  property 
would  contribute  the  better  learning  efficiency  and 
the  suppression  of  harmful  drift  of  the  policy  that  are 
shown  in  the  experiments. 


To  Overcome  non-Markovian:  There 

are  many  ways  to  implement  the  critic’s  learning 
scheme.  [Peng  et  al.  94]  and  [Sutton  95]  pointed  out 
that  increasing  A  makes  TD(A)  less  sensitive  to  non- 
Markovian  effect.  The  actor’s  eligibility  traces  are 
also  useful  in  getting  over  non-Markovian  problems 
[Kimura  et  al.  97].  Therefore,  the  combination  of 
TD(A)  and  the  actor’s  eligibility  trace  will  be  robuster 
in  non-Markovian  problems. 

Combining  with  EfRicient  DP-based 
Methods:  If  the  hidden  state  is  relatively  small 

in  the  state  space,  the  agent  may  perform  good  in 
which  efficient  DP-based  algorithms  are  adopted  for 
the  critic.  The  DP-based  algorithms  accelerate  the  ac¬ 
tor’s  learning  in  completely  observable  states,  and  the 
actor’s  stochastic  policy  and  its  trace  (yd  =  7)  would 
make  up  for  the  non-Markovian  effects  owing  to  the 
hidden  state  or  function  approximation. 

8  Conclusions 

This  paper  presented  an  analysis  of  actor/critic  algo¬ 
rithms  in  which  the  actor  updates  its  policy  using  the 
eligibility  trace  of  the  policy  parameters.  The  results 
show  that  when  the  discount  rate  of  the  value  function 
equals  the  discount  factor  of  the  actor’s  trace,  the  actor 
improves  its  policy  by  using  a  gradient  of  actual  return, 
not  by  using  a  gradient  of  the  estimated  return  in  the 
critic.  Then,  the  critic  provides  an  adaptive  reinforce¬ 
ment  baseline  to  the  actor  controlling  the  step-size  of 
the  actor’s  backups.  It  enables  the  agent  to  learn  a 
fairly  good  policy  under  the  condition  that  the  approx¬ 
imated  value  function  in  the  critic  is  hopelessly  imper¬ 
fect.  The  behavior  is  demonstrated  through  simula¬ 
tions  showing  that  the  trace  contributes  the  learning 
efficiency  and  the  suppression  of  undesirable  drifts  of 
the  policy.  Analysis  of  the  algorithm  in  non-Markovian 
environments  is  a  future  work. 
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Abstract 

To  monitor  or  control  a  stochastic  dynamic  system, 
we  need  to  reason  about  its  current  state.  Exact 
inference  for  this  task  requires  that  we  maintain  a 
complete  joint  probability  distribution  over  the  pos¬ 
sible  states,  an  impossible  requirement  for  most  pro¬ 
cesses.  Stochastic  simulation  algorithms  provide  an 
alternative  solution  by  approximating  the  distribu¬ 
tion  at  time  t  via  a  (relatively  small)  set  of  samples. 

The  time  t  samples  are  used  as  the  basis  for  generat¬ 
ing  the  samples  at  time  f  -|- 1.  However,  since  only 
existing  samples  are  used  as  the  basis  for  the  next 
sampling  phase,  new  parts  of  the  space  are  never  ex¬ 
plored.  We  propose  an  approach  whereby  we  try  to 
generalize  from  the  time  t  samples  to  unsampled  re¬ 
gions  of  the  state  space.  Thus,  these  samples  are 
used  as  data  for  learning  a  distribution  over  the  states 
at  time  t,  which  is  then  used  to  generate  the  time  t+1 
samples.  We  examine  different  representations  for  a 
distribution,  including  density  trees,  Bayesian  net¬ 
works,  and  tree-stmctured  Bayesian  networks,  and 
evaluate  their  appropriateness  to  the  task.  The  ma¬ 
chine  learning  perspective  allows  us  to  examine  is¬ 
sues  such  as  the  tradeoffs  of  using  more  complex 
models,  and  to  utilize  important  techniques  such  as 
regularization  and  priors.  We  validate  the  perfor¬ 
mance  of  our  algorithm  on  both  artificial  and  real 
domains,  and  show  significant  improvement  in  ac¬ 
curacy  over  the  existing  approach. 

1  Introduction 

In  many  real-world  domains,  we  are  interested  in  moni¬ 
toring  the  evolution  of  a  complex  situation  over  time.  For 
example,  we  rnay  be  monitoring  a  patient’s  vital  signs  in 
an  ICU,  analyzing  a  complex  freeway  traffic  scene  with  the 
goal  of  controlling  a  moving  vehicle,  or  even  tracking  mo¬ 
tion  of  objects  in  a  visual  scene.  Such  systems  h^ve  com¬ 
plex  and  unpredictable  dynamics;  thus,  they  are  often  mod¬ 
eled  as  stochastic  dynamic  systems.  Even  when  a  moclel  of 


the  system  is  known,  reasoning  about  the  system  is  a  com¬ 
putationally  difficult  task.  Our  main  concern  in  this  paper 
is  in  using  machine  learning  techniques  as  part  of  a  reason¬ 
ing  task;  specifically,  the  task  of  monitoring  the  state  of  the 
system  as  it  evolves  and  as  new  observations  are  obtained. 

Theoretically,  the  monitoring  task  is  straightforward. 
We  simply  maintain  a  probability  distribution  over  the  pos¬ 
sible  states  at  the  current  time.  As  time  evolves,  we  update 
this  distribution  using  the  transition  model;  as  new  observa¬ 
tions  are  obtained,  we  use  Bayesian  conditioning  to  update 
it.  Such  a  distribution  is  called  a  belief  state;  in  a  Marko¬ 
vian  process,  it  provides  a  concise  summary  of  all  of  our 
past  observations,  and  suffices  both  for  predicting  the  fu¬ 
ture  trajectory  of  the  system  as  well  as  for  making  optimal 
decisions  about  our  actions  [Ast65]. 

Unfortunately,  even  systems  whose  evolution  model  is 
compactly  represented  rarely  admit  a  compact  representa¬ 
tion  of  the  belief  state  and  an  effective  update  process.  Con¬ 
sider,  for  example,  a  stochastic  system  represented  as  a  dy¬ 
namic  Bayesian  network  (DBN)  [DK89].  A  DBN  partitions 
the  evolution  of  the  process  into  time  slices,  each  of  which 
represents  a  snapshot  of  the  state  of  the  system  at  one  point 
in  time.  Like  a  Bayesian  network  (BN),  the  DBN  utilizes 
a  decomposed  representation  of  the  state  via  state  variables 
and  a  graphical  notation  to  repesent  the  direct  dependencies 
between  the  variables  in  the  model.  The  evolution  model  of 
the  system — the  distribution  over  states  at  time  t  -\-l  given 
the  state  at  time  t — is  represented  in  a  network  fragment 
such  as  the  one  in  Figure  1(a)  (appropriately  annotated  with 
probabilities).  DBNs  have  been  used  for  a  variety  of  appli¬ 
cations,  including  freeway  surveillance  [FHKR95],  moni¬ 
toring  complex  factories  [JKOP89],  and  more. 

Exact  inference  algorithms  for  BNs  have  analogues  for 
inference  in  DBNs  [Kja92].  Unfortunately,  in  most  cases-, 
these  algorithms  also  end  up  maintaining  a  belief  state — a 
distribution  over  mOst  or  all  of  the  variables  in  a  time  slice. 
Furthermore,  it  can  be  shown  [BK98]  that  the  belief  state 
rarely  has  any  structure  that  may  supjioft  a  compact  repre¬ 
sentation.  Thus,  exact  inference  algorithms  are  forced  to 
maintain  a  fully  explicit  joint  distribution  over  an  exponen- 
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(a) 


(b) 


(c) 


Figure  1 :  (a)  The  simple  capital  2TBN  for  tracking  the  growth  of  a  hi-tech  company;  (b)  The  same  2TBN  unrolled  for  3  time  slices; 
(c)  The  WATER  2TBN. 


tially  large  state  space,  making  them  impractical  for  most 
complex  systems. 

A  similar  problem  arises  when  we  attempt  to  monitor  a 
process  with  complex  continuous  dynamics.  Here  also,  an 
explicit  representation  of  the  belief  state  is  infeasible. 

This  limitation  has  led  to  work  on  approximate  infer¬ 
ence  algorithms  for  complex  stochastic  processes  [GJ96, 
BK98,  KKR95,  IB96].  Of  the  approaches  proposed, 
stochastic  simulation  algorithms  are  conceptually  sim¬ 
plest  and  make  the  fewest  assumptions  about  the  struc¬ 
ture  of  the  process.  The  survival  of  the  fittest  (SOF)  al¬ 
gorithm  [KKR95]  has  been  applied  with  success  to  large 
discrete  DBNs  [FHKR95].  The  same  algorithm  (indepen¬ 
dently  discovered  by  [IB96])  has  been  applied  to  the  con¬ 
tinuous  problem  of  tracking  object  motion  in  cluttered  vi¬ 
sual  scenes. 

The  algorithm,  which  builds  on  stochastic  simulation 
algorithms  for  standard  BNs  [SP89],  is  as  follows:  For 
each  time  slice,  we  maintain  a  (small)  set  of  weighted  sam¬ 
ples;  a  sample  is  one  possible  state  of  the  system  at  that 
time,  while  its  weight  is  some  measure  of  how  likely  it  is. 
This  set  of  weighted  samples  is,  in  effect,  a  very  sparse 
estimate  of  the  belief  state  at  time  t.  A  sample  at  time 
t  is  propagated  to  time  f  -t-  1  by  a  random  process  based 
on  the  dynamics  of  the  system.  In  a  naive  generalization 
of  [SP89],  each  time  t  sample  is  propagated  forward  to  time 
t  +  1.  However,  as  shown  by  [KKR95],  this  approach  re¬ 
sults  in  extremely  poor  performance,  with  the  error  of  the 
approximation  diverging  rapidly  as  t  grows.  They  propose 
an  approach  where  samples  are  propagated  preferentially: 
those  whose  weight  is  higher  are  more  likely  to  be  propa¬ 
gated,  while  the  lower  weight  ones  tend  to  be  “killed  off.” 
Technically,  samples  from  time  t  are  selected  for  propa¬ 
gation  using  a  random  process  that  chooses  each  sample 
proportionately  to  its  weight.  The  resulting  trajectories  are 
weighted  based  on  how  well  they  fit  the  new  evidence  at 
time  t  +  1,  and  the  process  continues.  Despite  its  simplicity 
and  low  computational  cost,  the  SOF  algorithm  performs 
very  well;  as  shown  in  [KKR95],  its  error  seems  to  remain 
bounded  indefinitely  over  time.  As  shown  in  [IB96],  this 
algorithm  can  also  deal  with  complex  continuous  processes 


much  more  successfully  than  standard  techniques. 

In  this  paper,  we  use  machine  learning  techniques  to  im¬ 
prove  the  behavior  of  the  SOF  algorithm,  with  the  goal  of 
applying  it  to  real-world  complex  domains.  The  SOF  algo¬ 
rithm  shifts  its  effort  from  less  likely  to  more  likely  trajec¬ 
tories,  thereby  focusing  on  the  more  relevant  parts  of  the 
space.  However,  at  time  f  -f  1,  it  only  samples  parts  of  the 
space  that  arise  from  samples  that  it  had  at  time  t.  Thus, 
it  does  not  allow  for  correcting  earlier  mistakes:  if  its  sam¬ 
ples  at  time  t  were  unrepresentative  in  some  way,  then  its 
samples  at  time  f  -t- 1  arc  also  likely  to  be  so.  We  can  rein¬ 
terpret  this  behavior  from  a  somewhat  different  perspective. 
The  set  of  time  t  samples  arc  an  approximation  to  the  belief 
state  at  time  t.  When  SOF  chooses  which  samples  to  prop¬ 
agate  it  is  simply  sampling  from  this  approximate  belief 
state.  Thus,  the  SOF  algorithm  is  using  a  set  of  weighted 
samples  as  an  approximation  to  a  belief  state,  and  a  random 
sampling  process  to  propagate  an  approximate  belief  state 
at  time  t  to  one  at  time  t+1. 

This  perspective  is  a  natural  starting  point  for  our  ap¬ 
proach.  Clearly,  a  small  number  of  weighted  points  is  a 
suboptimal  way  of  representing  a  complex  distribution  over 
a  large  space:  As  the  number  of  samples  is  much  smaller 
than  the  total  size  of  the  space,  the  representation  is  very 
sparse,  and  therefore  necessarily  unrepresentative.  Our  key 
insight  is  that  our  information  about  the  relative  likelihood 
of  even  a  small  number  of  points  in  the  space  can  tell  us 
a  lot  about  the  relative  likelihood  of  others.  Thus,  we  can 
treat  our  samples  as  data  cases  and  use  them  to  leant  the 
shape  of  the  distribution.  In  other  words,  we  can  use  our 
own  randomly  generated  samples  as  input  to  a  density  esti¬ 
mation  algorithm,  and  use  them  to  learn  the  distribution. 

This  insight  leads  us  to  explore  a  number  of  improve¬ 
ments  to  the  SOF  algorithm.  We  first  note,  in  Section  3, 
that  the  number  of  samples  needed  to  adequately  estimate 
the  distribution  can  vary  widely:  in  situations  where  the 
evidence  is  unlikely,  more  samples  will  be  needed  in  order 
to  “find”  relevant  regions  of  the  space.  Luckily,  as  we  are 
generating  our  own  data,  we  can  generate  as  many  samples 
as  we  need;  that  is,  we  can  perform  a  simple  type  of  active 
learning  [CAL94]. 
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We  then  introduce  a  Dirichlet  prior  over  the  parameters 
of  our  distribution  in  order  to  deal  with  the  problem  of  nu¬ 
merical  overfitting,  a  particularly  serious  problem  when  we 
have  a  sparse  sample  for  a  very  large  space.  We  show  that 
even  these  two  simple  improvements  serve  to  significantly 
increase  the  accuracy  of  our  algorithm. 

We  then  proceed  to  investigate  the  issue  of  generalizing 
from  the  samples  to  other  parts  of  the  space.  Of  course,  in 
order  to  generalize,  we  need  a  representation  whose  bias  is 
higher.  The  requirements  of  our  task  impose  several  con¬ 
straints  both  on  the  representation  of  the  distribution  and 
on  the  algorithm  used  to  estimate  it.  First,  as  our  state 
space  is  exponentially  large,  we  must  restrict  attention  to 
compact  representations  of  distributions.  Second,  we  must 
allow  samples  to  be  generated  randomly  from  the  distribu¬ 
tion  in  a  very  efficient  way.  Thus,  for  example,  a  neural 
network  whose  input  is  a  possible  state  of  the  process  and 
whose  output  is  the  probability  of  that  state  would  not  be 
appropriate.  Finally,  as  we  are  primarily  interested  in  fast 
monitoring  in  time-critical  applications,  we  prefer  density 
estimation  algorithms  that  are  less  compute-intensive. 

Based  on  these  constraints,  we  explore  three  main  ap¬ 
proaches,  appropriate  to  processes  represented  as  DBNs: 
Bayesian  networks  with  a  fixed  structure,  tree-structured 
Bayesian  networks  with  a  variable  structure,  and  density 
trees,  which  resemble  decision  trees  or  discrete  regression 
trees.  We  compare  the  performance  of  these  algorithms  to 
that  of  the  SOF  algorithm,  and  show  that  all  three  track  the 
process  with  higher  accuracy.  We  show  that  the  density 
tree  approach  seems  particularly  promising,  and  suggest  a 
possible  explanation  as  to  why  it  behaves  better  than  the 
other  approaches.  We  conclude  with  some  discussion  and 
possible  extensions  of  our  approach  to  other  domains. 

2  Preliminaries 

A  discrete  time  stochastic  process  is  viewed  as  evolving 
randomly  from  one  state  to  another  at  discrete  time  points. 
Formally,  there  is  a  set  of  states  S  such  that  at  any  point  in 
time  t,  the  situation  can  be  described  using  some  state  x  € 
S.  We  typically  assume  that  the  process  is  Markovian  so 
that  the  probability  of  being  in  state  x'  at  time  t+1  depends 
only  on  the  state  of  the  world  at  time  t.  Formally,  letting 
denote  the  random  variable  (or  set  of  random  vari¬ 
ables)  representing  the  state  at  time  t,  we  have  that 
is  independent  of  . . . ,  given  Thus, 

We  also  typically  assume  that  the  process  is  time  invariant, 
so  that  P{X^^^  I  does  not  depend  on  t.  Thus, 

it  can  be  specified  using  a  single  transition  model  which 
holds  for  all  time  points. 

In  a  DBN,  the  state  of  the  process  at  time  t  is  specified  in 
terms  of  a  set  of  state  variables  . . . ,  Xn^ .  The  transi¬ 
tion  model  therefore  has  to  define  a  probability  distribution 
, . . . ,  I  xf  ^ , . . . ,  We  specify  such 


a  distribution  using  a  network  fragment  called  a  2TBN — a 
two  time-slice  Bayesian  network,  as  shown  in  Figure  1(a). 
A  2TBN  defines  the  probability  distribution  for  any  time 
slice  f  -b  1  given  time  slice  t:  for  each  variable 
in  the  second  time  slice,  the  network  fragment  specifies  a 
set  of  parents  Parents{xl^'^^^),  which  can  be  variables  ei¬ 
ther  in  time  slice  f  -b  1  or  in  time  slice  f;  it  also  specifies 
a  conditional  probability  table,  which  describes  the  proba¬ 
bility  distribution  over  the  values  of  given  any  pos¬ 

sible  combination  of  values  for  its  parents.  As  a  whole,  the 
2TBN  completely  specifies  P(X^*“*‘^^  |  X^*^).  The  net¬ 
work  fragment  can  be  unrolled  to  define  a  distribution  over 
arbitrarily  many  time  slices.  Figure  1(b)  shows  the  2TBN 
of  Figure  1(a)  unrolled  over  three  time  slices. 

The  state  of  the  process  is  almost  never  fully  observ¬ 
able;  thus,  in  any  time  slice,  we  will  get  to  observe  the 
values  of  only  some  subset  of  the  variables.  In  most  mon¬ 
itoring  tasks,  the  set  of  observable  variables  is  the  same  in 
the  different  time  slices.  These  variables  typically  represent 
sensor  readings,  e.g.,  the  reading  of  some  blood-pressure 
monitor  in  the  ICU  or  the  output  of  a  video  camera  on  a 
freeway  overpass.  Let  be  the  set  of  observable  vari¬ 
ables  at  time  t,  and  let  be  the  instantiation  of  values  for 
these  variables  observed  at  time  t.  In  the  monitoring  task, 
we  are  interested  in  reasoning  about  Xj*^ , . . . , Xn^  given 
all  the  observations  seen  so  far;  i.e.,  we  want  to  maintain 
P(X(‘)  I  oW,...,oW). 

As  we  discussed  in  the  introduction,  the  stochastic  sim¬ 
ulation  algorithms  for  DBNs  is  based  on  the  standard  like¬ 
lihood  weighting  (LW)  algorithm.  The  algorithm,  shown 
in  Figure  2,  generates  a  sample  by  starting  at  the  roots  of 
the  network  and  continuing  in  a  top-down  fashion,  picking 
a  value  for  every  variable  in  turn.  A  value  for  a  variable 
is  sampled  according  to  the  appropriate  conditional  distri¬ 
bution,  given  the  values  already  selected  for  its  parents. 
Variables  whose  values  were  observed  as  evidenee  are  not 
sampled;  rather  the  variable  is  simply  instantiated  to  its  ob¬ 
served  value.  However,  we  must  eompensate  for  the  fact 
that  we  forced  this  variable  to  take  a  value  which  may  or 
may  not  be  likely.  Thus,  we  modify  the  weight  of  the  sam¬ 
ple  to  reflect  the  likelihood  of  having  observed  this  partic¬ 
ular  value  for  this  variable.  It  is  easy  to  see  that,  while  our 
algorithm  only  generates  points  x  that  are  eonsistent  with 
our  observations,  the  expected  weight  for  any  such  x  (i.e., 
the  probability  with  which  it  is  generated  times  its  weight 
when  it  is)  is  exactly  its  probability.  Thus,  our  weighted 
samples  are  an  unbiased  estimator  of  the  (unnormalized) 
distribution  over  the  states  and  the  observations.  Note  that 
the  weight  of  the  sample  represents  how  well  it  explains 
our  observations.  Thus,  a  sample  of  very  low  weight  is  a 
very  bad  explanation  for  the  observations,  and  contributes 
very  little  to  our  understanding  of  the  situation. 

The  straightforward  application  of  LW  to  DBNs  is  sim¬ 
ply  by  treating  the  DBN  as  one  very  long  BN.  Roughly 
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LikelihoodWeighting(a:^‘\  ' ) 
w  ~  1 

for  i  :=  1  to  n 

Let  u  be  the  assignment  to  in 

is  not  inO''+*> 

Sample  from  P(a'‘+'^  |  Paie>m{Xl‘"^'^)  ==  u) 
Else 

Set  to  be  Ai’s  observed  value  in 
Set  w~w  P(a'‘+'>  =  x|‘+*)  I  Parents{Xl*+^'>)  =  u) 
Retum(x^‘"'‘'\  w) 

Figure  2:  A  temporal  version  of  the  likelihood  weighting  algo¬ 
rithm;  it  generates  an  instantiation  for  the  time  f  -t- 1  vari¬ 

ables  given  an  instantiation  x^‘^  for  the  time  f  variables. 


speaking,  the  algorithm  would  maintain  a  set  of  samples 
a;^‘^[l], . . . ,  a:f*)[A^]  for  every  time  slice  t,  representing 
possible  states  of  the  process  at  that  time.  At  each  time 
slice  t,  each  of  the  samples  is  propagated  to  the  next  time 
slice  using  the  LW  algorithm,  and  its  weight  is  adjusted 
according  to  how  well  it  reflects  the  new  observations.  Un¬ 
fortunately,  as  observed  by  [KKR95],  this  approach  works 
very  poorly  for  most  DBNs.  Intuitively,  the  process  by 
which  samples  are  randomly  generated  is  oblivious  to  the 
evidence,  which  only  affects  the  weight  assigned  to  the 
samples.  Therefore,  the  samples  represent  random  trajec¬ 
tories  through  the  system,  most  of  which  are  completely 
irrelevant.  As  a  consequence,  as  shown  in  [KKR95],  the 
accuracy  of  LW  diverges  extremely  quickly  over  time. 

The  survival  of  the  fittest  algorithm  of  [KKR95]  ad¬ 
dresses  this  problem  by  preferentially  selecting  which  sam¬ 
ples  to  propagate  according  to  how  likely  they  are,  i.e.,  their 
weight  relative  to  other  samples.  Technically,  each  sam¬ 
ple  is  associated  with  a  weight  In  order  to 

propagate  to  the  next  time  slice,  the  algorithm  first  renor¬ 
malizes  all  of  the  weights  to  sum  to  1.  Then,  it 

generates  N  new  samples  for  time  f  -I-  1  as  follows:  For 
each  new  sample  j,  it  selects  randomly  from  among 
. . . ,  according  to  their  weight.  It  then  calls 

LW  with  as  a  starting  point,  and  gets  back  a  time  f  -I- 1 
sample  and  a  weight  w.  It  sets  := 

and  :=  w.  Note  that  the  weight  of  the  sam¬ 

ple  [j]  manifests  in  the  relative  proportion  with  which 
will  be  propagated,  so  that  we  do  not  need  to  re¬ 
count  it  when  defining  Kanazawa  er  a/,  show 

empirically  that,  unlike  LW,  the  error  of  SOF  seems  to  re¬ 
mains  bounded  indefinitely  over  time. 

3  Belief  state  estimation 

We  can  interpret  the  SOF  algorithm  as  estimating  a  proba¬ 
bility  distribution  over  the  states  at  time  t.  Having  gener¬ 
ated  some  number  of  samples  . . . ,  it  renor¬ 

malizes  their  weights  to  sum  to  1 .  The  result  is  a  simple 


count  distribution  over  the  states  at  time  t,  one  which 
gives  some  positive  probability  to  states  that  correspond  to 
one  or  more  samples,  and  zero  probability  to  all  the  rest. 
The  SOF  algorithm  then  generates  N  new  samples  from 
and  propagates  each  of  them  to  time  <  -t-  I  using  the 
LW  algorithm.  The  result  samples  are  again  renormalized, 
and  the  process  repeats. 

The  distribution  trie'  is  a  compact  approximation  to 
the  belief  state  at  time  t — the  correct  distribution 
I  . . . ,  Assuming  we  know  the  initial 
state  at  time  0,  is  precisely  the  belief  state  at  time  0. 
The  properties  of  LW  imply  that  our  weighted  samples  at 
time  1  are  an  unbiased  estimator  for  | 

Thu.s,  after  renormalization,  aic^  is  an  estimator  (albeit  a 
biased  one)  for  |  By  similar  reason- 

ing,  we  have  that  is  a  (biased)  estimator  for  P(X^^^  | 

. . . ,  However,  each  is  only  a  very  sparse 
approximation  to  and  thus  one  which  is  less  than  rep¬ 
resentative.  It  is  also  highly  variable,  with  a  strong  depen¬ 
dence  on  which  samples  we  happened  to  pick  at  the  mul¬ 
tiple  previous  random  sampling  stages.  Both  the  sparsity 
and  variability  of  our  estimate  propagate  to  tbe  next  time 
slice,  increasing  the  variance  of  our  approximation. 

Our  first  attempt  to  control  this  variance  relates  to  the 
amount  of  data  on  which  our  e.stimation  is  based.  Naively, 
it  seems  that,  at  each  phase,  we  arc  basing  our  estimation 
procedure  on  tbe  same  number  N  of  samples.  However, 
when  we  are  renormalizing  our  distribution,  we  arc  not  di¬ 
viding  by  N,  but  rather  by  the  total  weight  of  the  samples. 
Intuitively,  if  the  evidence  observed  at  a  given  time  point  is 
unlikely,  each  sample  generated  will  explain  it  less  well,  so 
that  its  weight  will  be  low.  Thus,  if  the  total  weight  of  our 
N  samples  is  low,  then  we  have  not  really  sampled  a  signif¬ 
icant  portion  of  the  probability  mass.  Indeed,  as  argued  by 
Dagum  and  Luby  [DL97],  the  actual  number  of  effective 
samples  is  their  total  weight.  Thus,  we  modify  the  algo¬ 
rithm  to  guarantee  that  our  estimation  is  based  on  a  fixed 
weight  rather  than  a  fixed  number  of  samples. 

Our  results  for  this  improvement,  applied  to  the  simple 
CAPITAL  network,  arc  shown  in  Figure  3.  The  data  were 
generated  over  25  runs.  In  each  run,  the  observations  were 
generated  randomly,  from  the  correct  distribution;  thus  they 
correspond  to  typical  runs  of  the  algorithm  over  a  typical 
evidence  sequence.  Figure  3(a)  shows  the  number  of  sam¬ 
ples  used  over  different  time  slices;  we  sec  that  the  number 
of  samples  varies  significantly  over  time,  illustrating  that 
the  algorithm  is  taking  advantage  of  the  additional  flexi¬ 
bility.  The  average  number  of  samples  per  time  slice  used 
over  the  run  is  65.  Figure  3(b)  compares  the  accuracy  of 
this  algorithm  to  that  of  a  fixed-samples  algorithm  using  65 
samples  in  each  time  slice.  We  see  that  while  the  average 
number  of  samples  used  is  tbe  same,  the  variable-samples 
approach  obtains  consistently  higher  accuracy;  in  order  to 
obtain  comparable  accuracy  from  the  fixed-samples  algo- 
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Figure  3:  Comparison  of  variable-samples  and  fixed-samples  algorithms  for  the  capital  network,  averaged  over  25  sequences:  (a) 
number  of  samples  used;  (b)  Ci  error. 


rithm,  around  70  samples  are  needed.  We  note  that  while 
both  the  error  and  the  number  of  samples  varies  widely, 
they  remain  bounded  indefinitely  over  time.  This  bound¬ 
edness  property  continues  to  hold  even  in  long  runs  with 
thousands  of  time  slices.  We  also  note  that  the  number  of 
states  in  the  explicit  belief  state  representation  is  256,  as 
compared  to  55-80  samples  used;  thus  our  sampling  ap¬ 
proach  allows  considerable  savings. 

We  have  experimented  with  the  number  of  samples  re¬ 
quired  for  different  evidence  sequences.  Our  results  show 
that  unlikely  evidence  sequences  require  many  more  sam¬ 
ples  than  likely  evidence,  thereby  justifying  our  intuition 
about  the  reason  for  the  variability  in  the  number  of  sam¬ 
ples  needed.  Furthermore,  the  accuracy  maintained  by  the 
variable-samples  algorithm  for  likely  and  unlikely  runs  is 
essentially  the  same;  thus,  in  a  way,  the  algorithm  gener¬ 
ates  as  many  samples  as  it  needs  to  maintain  a  certain  level 
of  performance.  We  can  view  this  ability  as  a  type  of  ac¬ 
tive  learning  [CAL94],  where  the  learning  algorithm  has 
the  ability  to  ask  for  more  data  cases  when  necessary.  In 
our  context,  the  active  learning  paradigm  is  particularly  ap¬ 
propriate,  as  the  algorithm  is  generating  its  own  data  cases. 

Our  next  improvement  relates  to  another  problem  with 
the  SOF  algorithm.  Our  time  t  samples  are  necessarily  very 
sparse,  so  that  many  entries  in  the  probability  distribution 
O-ic  will  have  zero  probability,  even  though  their  true  prob¬ 
ability  is  positive.  This  type  of  behavior  can  cause  signif¬ 
icant  problems,  as  samples  at  time  f  -|-  1  are  only  gener¬ 
ated  based  on  our  existing  samples  at  time  t.  If  tbe  pro¬ 
cess  is  not  very  stochastic,  i.e.,  if  there  are  parts  of  the 
state  space  that  only  transition  to  other  parts  with  very  low 
probability,  parts  of  the  space  that  are  not  represented  in 
a^sc  will  not  be  explored.  Unfortunately,  the  parts  of  the 
space  that  are  not  represented  may  be  quite  likely;  our  sam¬ 
pling  process  may  simply  have  missed  them  earlier,  or  they 
may  be  the  results  of  trajectories  that  appeared  unlikely  in 


earlier  time  slices  because  of  misleading  evidence.  This 
problem  is  reflected  clearly  if  we  measure  the  distance  be¬ 
tween  our  approximation  and  the  exact  distribution  using 
relative  entropy  [CT91],  for  many  reasons  the  most  ap¬ 
propriate  distance  measure  for  this  type  of  situation.  For 
an  exact  distribution  (j>  and  an  approximate  one  V’  over 
the  same  space  U,  the  relative  entropy  D{<p\\ip)  is  defined 
as  In  cases,  such  as  ours, 

where  the  approximate  distribution  ascribes  probability  0 
to  entries  that  are  not  impossible,  the  relative  entropy  dis¬ 
tance  is  infinite. 

The  machine  learning  perspective  offers  us  a  simple  and 
theoretically  well-founded  solution  to  the  problem  of  un¬ 
warranted  zeros  in  our  estimated  distribution.  We  view  the 
problem  from  the  perspective  of  Bayesian  learning,  where 
we  have  a  prior  distribution  over  the  parameters  we  are  try¬ 
ing  to  estimate:  the  probabilities  63.  of  the  different  states 
X  in  our  belief  state.  An  appropriate  prior  for  multinomial 
distributions  such  as  this  is  the  Dirichlet  distribution.  We 
omit  the  formal  definition  of  the  Dirichlet  prior,  referring 
the  reader  to  [Deg86].  Intuitively,  it  is  defined  using  a  set  of 
hyperparameters  a*; ,  each  representing  “imaginary”  sam¬ 
ples  observed  for  the  state  Xi.  In  our  case,  as  we  have  no 
beliefs  in  favor  of  one  state  x  over  another,  we  chose  ax- 
to  be  uniformly  a/r,  where  r  is  the  total  number  of  states 
consistent  with  our  evidence. 

Computing  with  this  seemingly  complex  two-level  dis¬ 
tribution  is  actually  quite  simple,  as  most  computations  are 
equivalent  to  working  with  a  single  distribution  a'gc+’ 
tained  from  taking  the  expectation  of  the  parameters  {^a,} 
relative  to  their  prior  distribution.  In  our  case,  for  each 
X  consistent  with  our  evidence,  we  have  that  (x)  = 
(w^^-llx)  alr)IZ  where  w^^^x)  is  the  total  weight  of 
samples  whose  value  is  x,  and  Z  is  a  normaliz¬ 

ing  factor.  We  see  that  each  instantiation  x  in  our  distri¬ 
bution  (if  consistent  with  our  evidence)  will  have  at 
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least  some  very  small  probability  mass.  Wc  note  that,  even 
though  >  0  for  every  x,  we  need  only  represent 

explicitly  those  states  which  have  materialized  in  our  sam¬ 
pling  algorithm.  Thus,  the  cost  of  maintaining  such  a  dis¬ 
tribution  is  no  higher  than  that  of  maintaining  our  original 
sparse  set  of  samples. 

The  introduction  of  a  prior  serves  to  “spread  out”  some 
of  the  probability  mass  over  unobserved  states,  increasing 
the  amount  of  exploration  done  for  unfamiliar  regions  of 
the  space.  We  investigated  the  tradeoff  between  sampling 
in  regions  that  are  known  to  be  likely  and  in  new  regions. 
As  Q  grows,  the  performance  of  our  algorithm  first  im¬ 
proves,  then  gradually  decreases,  as  wc  would  expect. 

4  Alternative  belief  states  representations 

While  this  approach  allows  us  to  generate  samples  from 
unexplored  parts  of  the  space,  it  does  so  blindly:  all  un¬ 
sampled  states  arc  treated  in  exactly  the  same  way.  How¬ 
ever,  our  state  space  is  not  a  completely  arbitrary  set  of 
points.  Two  states  x  and  x'  which  give  the  same  values 
to  almost  all  of  the  variables  in  our  domain  may  be  quite 
similar,  and  it  may  make  sense  to  assume  that  their  proba¬ 
bilities  are  much  closer  than  that  of  other  pairs.  That  is,  we 
want  to  use  our  results  for  the  states  that  we  sampled  to  in¬ 
duce  the  probabilities  of  other  states.  This  task  is  precisely 
a  density  estimation  task  (a  type  of  unsupervised  learning), 
where  the  set  of  sampled  states  arc  the  training  data.^ 

As  in  any  learning  task,  wc  must  first  define  the  hypoth¬ 
esis  space.  Essentially,  our  representations  above  fall  into 
the  category  of  nonparametric  density  estimators  [Sco92]. 
(Roughly  speaking,  they  arc  a  discrete  form  of  Parzen  win¬ 
dow.)  As  applied  in  our  setting,  these  density  estimators 
have  no  bias  (and  high  variance);  thus,  they  are  incapable 
of  generalizing  from  the  training  data  to  the  rest  of  the 
space.  In  this  section,  we  explore  alternative  representa¬ 
tions  of  discrete  densities  that  have  higher  bias  and  a  corre¬ 
spondingly  higher  generalization  power.  As  we  discussed 
in  the  introduction,  not  every  representation  is  suitable  for 
our  needs.  Our  representation  must  be  significantly  more 
compact  than  the  full  joint  over  the  state  variables;  it  must 
support  an  effective  sampling  process;  and  it  must  be  eas¬ 
ily  learned.  Two  appropriate  representations  are  Bayesian 
networks  and  density  trees. 

Bayesian  networks.  Given  our  overall  problem,  a 
Bayesian  network  representation  for  our  distribution  seems 
particularly  appropriate.  After  all,  our  process  is  repre¬ 
sented  as  a  DBN,  and  is  therefore  highly  structured.  While 
it  is  known  that  conditional  independences  are  not  main¬ 
tained  in  the  belief  states  [BK98],  it  is  reasonable  to  as¬ 
sume  that  some  of  the  random  variables  in  a  time  slice  are 


^If  we  view  SOF  as  doing  a  process  akin  to  bootstrapping  by 
sampling  from  its  own  samples,  our  extension  is  akin  to  smoothed 
bootstrapping  [Sil86]. 


only  weakly  correlated  with  each  other,  and  perhaps  even 
weaker  when  conditioned  on  a  third  variable. 

There  has  been  a  substantial  amount  of  recent  work  on 
learning  Bayesian  networks  from  data  (see  [Hec95]  for  a 
survey).  The  simplest  option  is  to  fix  the  structure  of  the 
Bayesian  network  and  to  use  our  data  to  fill  in  the  pa¬ 
rameters  for  it.  This  process  can  be  accomplished  very 
efficiently,  by  a  simple  traversal  over  our  data.  Specif¬ 
ically,  if  our  Bayesian  network  contains  a  node  X  with 
parents  Y,  then  wc  need  to  estimate  each  of  the  param¬ 
eters  P{X  —  X  \  Y  =  y).  The  maximum  likelihood 
estimate  for  these  parameters  would  be  ■  How¬ 

ever,  maximum  likelihood  e.stimates  result  in  precisely  the 
type  of  numerical  overfitting  (and  particularly  zero  prob¬ 
ability  estimates)  that  wc  strove  to  avoid  in  the  previous 
section.  It  turns  out  that,  if  wc  instead  estimate  the  param¬ 
eter  as  .  wc  get  the  effect  of  introducing 

a  Dirichlct  prior  over  each  of  our  BN  parameters.  For  a 
given  Bayesian  nctw'ork  structure  B,  the  resulting  distri¬ 
bution  ObniB)  is  the  one  that  minimizes  D{asc+\\txi)„iB)) 
among  all  distributions  representable  by  B. 

One  potential  problem  with  this  approach  is  that  the 
BN  structure  B  must  be  detennined  in  advance,  based  on 
some  prior  knowledge  of  the  user  or  on  a  manual  analy¬ 
sis  of  the  DBN  structure.  Furthermore  the  BN  structure  is 
fixed  over  the  entire  length  of  the  run,  whereas  the  true  be¬ 
lief  state  (ji'i  can  change  drastically  as  the  process  evolves. 
It  seems  quite  likely  that  the  most  appropriate  BN  struc¬ 
ture  for  approximating  also  varies.  This  observation 
suggests  that  wc  select  a  different  BN  structure  for  each 
time  slice.  Unfortunately,  learning  of  BN  structure  is  a 
hard  problem.  Theoretically,  even  the  problem  of  learn¬ 
ing  the  optimal  structure  where  each  node  is  restricted  to 
have  at  most  k  parents  is  NP-hard  for  any  k  >  I  [CHG95]. 
Pragmatically,  the  algorithms  for  this  learning  task  arc  ex¬ 
pensive,  performing  a  greedy  search,  with  multiple  restarts, 
over  the  combinatorial  (and  superexponential)  space  of  BN 
structures. 

One  option  is  to  restrict  our  search  to  tree-structured 
BNs — ones  where  each  node  has  at  most  one  parent.  Chow 
and  Liu  [CL68]  present  a  simple  (quadratic  time)  algo¬ 
rithm  for  finding  the  tree-structured  BN  whose  distribu¬ 
tion  is  closest — in  terms  of  relative  entropy — to  the  one 
in  our  data.  The  intuition  is  that,  in  a  tree-structured  BN, 
the  edges  should  correspond  to  the  strongest  correlations. 
Thus,  the  algorithm  introduces  a  direct  connection  between 
the  variables  whose  mutual  information  [CT91]  is  largest. 
Formally,  for  each  pair  of  variables  Xi,Xj,  wc  define  an 
edge-weight 


W{Xi,Xj)  =  a,c-i-{xi,Xj)\og 

Xi  ,Xj 


which  is  precisely  the  mutual  information  between  Xi  and 
Xj  in  (7sc+-  We  then  choose  a  maximum-weight  spanning 
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tree  over  these  nodes,  where  the  weight  of  the  tree  is  the 
sum  of  the  weights  of  the  edges  it  spans.  We  then  select  an 
arbitrary  root  for  the  tree,  and  fill  in  the  conditional  proba¬ 
bility  tables  for  the  nodes  in  the  network  using  (7ac+  (as  in 
the  case  of  a  fixed  BN).  Here,  again,  choosing  the  parame¬ 
ters  using  crsc+  is  equivalent  to  introducing  a  Dirichlet  prior 
over  the  network  parameters.  Chow  and  Liu  show  that  the 
distribution  agt  represented  by  the  resulting  spanning  tree 
minimizes  D{asc+\\(rst)- 

Density  trees.  A  discrete  density  tree  is  similar  in  overall 
structure  to  a  classification  decision  tree.  However,  rather 
than  representing  a  conditional  distribution  over  some  dis¬ 
tinguished  class  variable  given  the  features,  the  density  tree 
represents  a  probability  distribution  Odt  over  some  set  of 
variables  X.  Each  interior  node  in  the  tree  is  labelled  with 
a  variable  X,  and  the  branches  at  that  node  with  values  x 
for  that  variable.  A  path  on  the  tree  to  a  node  n  thus  corre¬ 
sponds  to  an  assignment  to  some  subset  of  the  variables 
in  the  domain  y„. 

The  tree  is  a  recursive  representation  of  a  multivariate 
distribution.  At  a  high  level,  the  tree  structure  partitions 
the  space  into  bins,  corresponding  to  the  leaves  in  the  tree. 
The  distribution  at  each  leaf  n  is  uniform  over  the  variables 
X  -  y„;  the  different  leaf  distributions  are  combined  in 
a  weighted  average,  where  the  weight  of  a  leaf  is  simply 
the  product  of  the  edge-weights  along  the  path  to  it.  More 
technically,  letting  Z„  represent  X  -  y„,  we  have  that  if 
n  is  a  leaf,  then  CdtiZn  \  n)  is  uniform  over  the  values  of 
Z„;  if  n  is  an  interior  node  labelled  with  X,  then  adt{Zn  i 
n)  =  Ex  ^dt{X  =  a:  I  n)  •  adt{Zn  -  {^}  i  n*),  where 
Tlx  is  the  child  of  n  corresponding  to  the  value  x  of  X  and 
Cdt{X  =  X  1  n)  is  the  weight  along  the  edge  to  it. 

Our  error  function  for  the  density  tree  learning  algo¬ 
rithm  is  the  relative  entropy  between  our  empirical  distri¬ 
bution  (Tsc+  and  Udt-  We  use  a  greedy  algorithm  which  is 
very  similar  to  the  one  for  classification  trees.  We  start  out 
with  the  tree  containing  only  the  root  node.  We  then  iter¬ 
atively  split  nodes  on  the  variable  that  most  decreases  this 
error  function.  At  each  point,  we  estimate  the  parameters 
using  crsc+,  as  we  did  for  BNs.  We  use  a  greedy  algorithm 
to  determine  the  splits.  The  contribution  that  n  makes  to 
the  overall  relative  entropy,  if  it  remains  a  leaf,  is  propor¬ 
tional  to  D{asc+{Zn  I  y„)||wz„).  where  uz„  is  the  uni¬ 
form  distribution  over  the  assignments  z  to  Z„.  If  we  split 
n  on  a  variable  X,  each  of  its  children  (assuming  they 
remain  leaves)  would  make  a  contribution  proportional  to 
<^sc+{x  I  yn)D{asc+{Zn  -  {X}  \y„,x)\\uz^_[X})-  It 
is  easy  to  show  that  the  decrease  in  the  relative  entropy  is 
precisely  D{asc+iX  \  2/„)||ux)-  We  split  n  on  that  vari¬ 
able  X  which  maximizes  this  decrease.  Intuitively,  this 
rule  makes  perfect  sense:  if  we  are  representing  the  dis¬ 
tributions  at  the  leaves  as  uniform,  then  we  should  first  ex¬ 
tract  these  variables  whose  marginal  distribution  at  n  is  the 
farthest  from  being  uniform. 

In  order  to  avoid  overfitting,  we  prevent  the  density-tree 


Relative  error 

#samples/slice 

runtime/slice 

(minutes) 

counting 

2.275  ±  1.07  X  10”'* 

1024  ±  1066 

0.722 

Chow-Liu  tree 

2,106  ±0.75  X  10““ 

962  ±  977 

0.044 

BN  I  (29  params) 

2.102  ±0.96  X  10““ 

981  ±  970 

0.064 

BN  2  (340  params) 

2.104  ±0.79  X  10““ 

962  ±  1045 

0.07 

BN  3  (1401  params) 

2.112  ±0.73  X  10““ 

990  ±  1045 

0.07 

density  tree 

1.816  ±0.89  X  10““ 

985  ±  1063 

0.068 

Figure  4:  Means  and  standard  deviations  for  different  belief  state 
representations 


from  growing  to  fit  all  of  the  samples.  We  utilize  the  stan¬ 
dard  idea  of  early  stopping-,  our  stopping  rule  prevents  a 
node  from  splitting  when  the  improvement  to  the  relative 
entropy  score  is  lower  than  some  minimal  amount.  Specif¬ 
ically,  we  only  allow  a  split  of  n  on  X  when  Osc+iVn)  ' 
D{asc+{X  1  y„)l|ux)  is  higher  than  some  threshold. 

We  note  that  our  notion  of  a  density  tree  draws  upon 
the  literature  of  semiparametric  density  estimation  tech¬ 
niques  for  continuous  densities  [Sco92].  The  uniform  dis¬ 
tribution  over  samples  at  each  leaf  is  similar  to  multi¬ 
dimensional  histogram  techniques;  however,  the  tree  struc¬ 
ture  allows  variable-sized  bins,  and  therefore  greater  flex¬ 
ibility  in  matching  the  number  of  parameters  to  the  com¬ 
plexity  of  the  distribution. 

5  Experimental  results 

To  provide  a  more  realistic  comparison,  we  tested  the  dif¬ 
ferent  variants  of  our  algorithm  on  the  practical  WATER 
DBN  [JKOP89],  used  for  monitoring  the  biological  pro¬ 
cesses  of  a  water  purification  plant.  (Comparable  results 
were  obtained  for  the  CAPITAL  network.)  The  WATER  DBN 
had  a  substantially  larger  state  space,  with  27,648  possible 
values  taken  by  the  (non-evidence)  variables.  The  structure 
of  the  WATER  network  is  shown  in  Figure  1(c). 

We  experimented  with  several  belief  state  representa¬ 
tions:  simple  counting  (SOF  extended  with  priors);  three 
different  Bayesian  networks  of  fixed  structure,  with  29, 
340,  and  1401  parameters  respectively;  Chow-Liu  span¬ 
ning  trees;  and  density  trees.  We  tested  each  representa¬ 
tion  on  10  runs,  each  of  length  100,  and  where  we  used  a 
variable-samples  approach  with  a  target  weight  of  5.  For 
each  run,  we  tested  the  average  relative  entropy  error  over 
the  run.^  (We  also  tested  £i  error,  with  comparable  re¬ 
sults.)  We  then  computed  the  mean  and  standard  deviation 
of  these  run-average  errors  for  the  different  representations. 
We  did  the  same  for  the  number  of  samples  utilized  per 
time  slice.  The  results  are  shown  in  Figure  4. 

Not  surprisingly,  the  worst  performer  in  terms  of  ac- 


^We  note  that  the  momentary  errors  within  a  run — for  belief 
states  at  individual  time  slices — can  also  vary  widely,  as  can  be 
seen  from  Figure  3.  We  tested  the  standard  deviation  of  the  mo¬ 
mentary  errors  within  a  run,  and  it  was  approximately  the  same 
among  all  representations— around  50-55%  of  the  overall  aver¬ 
age  for  the  mn.  We  omit  the  detailed  results. 
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curacy  is  the  simple  counting  approach.  The  performance 
of  the  Chow-Liu  trees  and  the  fixed  Bayesian  networks  is 
about  comparable,  although  the  Bayesian  network  with  a 
large  number  of  parameters  performs  slightly  worse  than 
the  rest.  The  density  tree  approach  performs  best,  with 
a  fairly  significant  margin.  The  number  of  samples  used 
by  the  different  approaches  are  not  significantly  different. 
What  is  significant  is  the  fact  that  the  number  of  samples 
generated  is  a  factor  of  15-25  smaller  than  the  number 
of  states  in  the  state  space.  Indeed,  the  running  times  for 
the  different  approaches  are  all  significantly  lower  than  the 
1 .89  minutes  per  time  slice  required  by  exact  inference.  We 
note  that  the  running  times  were  all  estimated  on  simple 
prototype  code.  We  expect  the  running  times  for  optimized 
code  to  be  significantly  lower.  However,  the  relative  effi¬ 
ciencies  of  the  different  algorithms  should  remain  the  same. 


Figure  5:  Average  error  for  the  water  network  for  different  tar¬ 
get  weights.  The  average  is  over  10  runs  of  100  time  slices  each. 

Figure  5  gives  more  evidence  in  favor  of  the  density  tree 
approach,  demonstrating  that  it  makes  somewhat  better  use 
of  data.  The  graph  is  a  type  of  learning  curve  for  the  dif¬ 
ferent  approaches:  the  average  error  as  a  function  of  the 
target  weight.  We  see  that,  for  any  given  target  weight, 
the  density  tree  achieves  higher  accuracy.  Furthermore,  as 
we  increase  the  target  weight  for  our  sampling  algorithm, 
the  error  in  the  density  tree  approach  descreases  slightlu 
faster.  We  note  that  this  improvement  does  not  come  at  the 
expense  of  inereasing  the  overall  number  of  samples:  our 
experiments  show  that  the  average  number  of  samples  used 
is  essentially  identieal  for  the  different  algorithms,  and  es¬ 
sentially  linear  in  the  target  weight. 

We  believe  that  two  factors  contribute  to  making  den¬ 
sity  trees  a  suitable  representation  for  this  task.  The  first  is 
its  inductive  bias.  A  BN  representation  reflects  an  assump¬ 
tion  that  some  of  the  random  variables  in  the  domain  in¬ 
fluence  each  other  only  weakly  or  indirectly  via  other  vari¬ 
ables.  A  density  tree  representation  reflects  an  assumption 
that  the  distribution  is  substantially  different  when  condi¬ 


tioned  on  different  values  of  the  same  variable.  Our  re¬ 
sults  indicate  that  the  variability  across  different  values  of 
a  variable  is  a  more  significant  factor  than  any  (weak)  in¬ 
dependences  found  in  the  distribution.  Wc  believe  that  the 
evidence  serves  to  sharply  skew  the  distribution  in  a  certain 
direction,  making  it  much  more  important  for  the  approx¬ 
imate  probability  distribution  to  appropriately  model  that 
part  of  the  space.  Indeed,  an  examination  of  the  trees  pro¬ 
duced  by  the  density  tree  algorithm  for  different  time  slices 
shows  that  the  parts  of  the  tree  corresponding  to  more  likely 
parts  of  the  space  are  usually  represented  using  a  much 
finer  granularity — with  subtrees  that  arc  two  or  more  levels 
deeper — than  the  less  likely  ones, 

A  secondary  factor  that  wc  believe  also  contributes  to 
these  performance  results  is  the  more  flexible  choice  of  the 
structure  of  the  representation.  This  flexibility,  which  is 
shared  by  Chow-Liu  trees  and  density  trees,  allows  the  rep¬ 
resentation  of  the  approximate  belief  state  to  adapt  to  the 
current  state  of  the  process.  An  examination  of  the  actual 
models  learned  by  the.se  algorithms  at  different  points  in 
time,  shows  that  the  structure  docs,  in  fact,  vary  signifi¬ 
cantly.  This  property  is  particularly  helpful  in  the  density 
tree  case,  as  the  most  likely  part  of  the  state  space  changes 
in  virtually  every  time  slice. 

6  Extensions  and  Conclusions 

This  paper  deals  with  sampling-ba.scd  approximate  mon¬ 
itoring  algorithms  for  a  stochastic  dynamic  process.  Wc 
have  proposed  the  use  of  machine  learning  techniques  in  or¬ 
der  to  allow  the  algorithm  to  generalize  from  samples  it  has 
generated  to  samples  it  has  not.  We  have  shown  that  this 
idea  can  significantly  improve  the  quality  of  our  tracking 
for  a  given  allocation  of  computational  resources.  We  note 
that  a  related  idea  [BD97]  has  been  proposed  in  the  domain 
of  combinatorial  optimization  algorithms,  and  has  proved 
very  effective.  There,  rather  than  maintaining  a  popula¬ 
tion  of  candidate  solutions  (as  in  genetic  algorithms),  the 
“good”  candidate  solutions  generated  by  tbe  algorithm  are 
used  to  learn  a  distribution,  from  which  samples  are  then 
generated  for  the  next  optimization  phase. 

Wc  have  investigated  the  use  of  several  representations 
for  our  probability  distributions.  Wc  saw  that  wc  get  signif¬ 
icant  benefits  from  allowing  the  structure  of  the  distribution 
to  vary  according  to  context — both  for  different  parts  of  the 
space  within  the  same  distribution,  and  for  different  distri¬ 
butions  over  time.  In  our  density  tree  representation,  this 
flexibility  was  part  of  the  definition.  It  would  be  interest¬ 
ing  to  see  whether  we  could  get  even  better  performance 
by  allowing  the  other  representations  to  be  more  flexible. 
One  possibility  is  to  combine  Bayesian  networks  and  den¬ 
sity  trees;  there  are  several  ways  of  doing  so,  which  wc  arc 
currently  investigating.  We  are  also  considering  the  use  of 
other  (computationally  more  expensive)  representations  of 
a  density,  e.g.,  as  a  mixture  model  where  the  mixture  com¬ 
ponents  have  independent  features  [CS95]. 
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It  is  interesting  to  also  compare  our  approach  to  other 
types  of  algorithms  for  inference  in  stochastic  processes. 
As  we  have  shown,  the  number  of  samples  generated 
by  our  algorithm  is  significantly  lower  than  the  num¬ 
ber  of  states  in  the  explicit  representation  of  the  belief 
state.  Thus,  our  algorithm  allows  us  to  deal  with  do¬ 
mains  in  which  exact  inference  is  intractable.  Another  op¬ 
tion  is  to  use  non-stochastic  approximate  inference  algo¬ 
rithms  [GJ96,  BK98].  The  approach  of  [GJ96]  is  not  re¬ 
ally  intended  for  real-time  monitoring,  and  is  probably  too 
computationally  expensive  to  be  used  in  that  role.  It  also 
applies  only  to  a  fairly  narrow  class  of  stochastic  models. 
The  algorithm  of  [BK98]  is  more  comparable  to  ours;  es¬ 
sentially,  it  avoids  the  sampling  step,  directly  propagating  a 
time  t  approximate  belief  state  to  a  time  t  +  1  approximate 
belief  state.  For  certain  types  of  processes,  this  approach 
probably  dominates  ours,  as  it  avoids  the  additional  vari¬ 
ance  introduced  by  the  sampling  phase.  However,  it  is  not 
obvious  how  it  can  be  implemented  effectively  for  all  be¬ 
lief  state  representations  (e.g.,  for  density  trees).  Further¬ 
more,  it  does  not  apply  to  processes  where  the  represen¬ 
tation  of  the  process  itself  does  not  admit  exact  inference 
(e.g.,  highly-connected  DBN  models  or  models  involving 
continuous  variables). 

By  contrast,  we  note  that  our  ideas  are  not  specific  to 
DBNs.  The  only  use  we  made  of  the  DBN  model  is  as  a 
representation  from  which  we  can  generate  random  sam¬ 
ples.  We  believe  that  our  ideas  apply  to  a  much  wider 
range  of  processes.  Indeed,  Isard  and  Blake  [IB96]  have 
obtained  impressive  results  by  using  a  stochastic  sampling 
algorithm  identical  to  simple  SOF  for  the  task  of  monitor¬ 
ing  object  motion  in  cluttered  scenes.  Here,  the  process  is 
described  using  fairly  complex  continuous  dynamics,  that 
do  not  permit  any  exact  inference  algorithm.  We  believe 
that  bur  ideas  can  also  be  used  to  provide  improved  algo¬ 
rithms  for  complex  processes  such  as  these,  as  well  as  for 
processes  involving  both  continuous  and  discrete  variables. 
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Abstract 

Similarity  is  an  important  and  widely  used  con¬ 
cept.  Previous  definitions  of  similarity  are  tied 
to  a  particular  application  or  a  form  of  knowl¬ 
edge  representation.  We  present  an  information- 
theoretic  definition  of  similarity  that  is  applica¬ 
ble  as  long  as  there  is  a  probabilistic  model.  We 
demonstrate  how  our  definition  can  be  used  to 
measure  the  similarity  in  a  number  of  different 
domains. 


1  Introduction 

Similarity  is  a  fundamental  and  widely  used  concept. 
Many  similarity  measures  have  been  proposed,  such  as 
information  content  [Resnik,  1995b],  mutual  information 
[Kindle,  1990],  Dice  coefficient  [Frakes  and  Baeza- Yates, 
1992],  cosine  coefficient  [Frakes  and  Baeza- Yates,  1992], 
distance-based  measurements  [Lee  et  al.,  1989;  Rada  et  al., 
1989],  and  feature  contrast  model  [Tversky,  1977].  McGill 
etc.  surveyed  and  compared  67  similarity  measures  used  in 
information  retrieval  [McGill  et  al.,  1979]. 

A  problem  with  previous  similarity  measures  is  that  each 
of  them  is  tied  to  a  particular  application  or  assumes  a 
particular  domain  model.  For  example,  distance-based 
measures  of  concept  similarity  (e.g.,  [Lee  et  al.,  1989; 
Rada  et  al.,  1989])  assume  that  the  domain  is  represented  in 
a  network.  If  a  collection  of  documents  is  not  represented 
as  a  network,  the  distance-based  measures  do  not  apply. 
The  Dice  and  cosine  coefficients  are  applicable  only  when 
the  objects  are  represented  as  numerical  feature  vectors. 

Another  problem  with  the  previous  similarity  measures  is 
that  their  underlying  assumptions  are  often  not  explicitly 
stated.  Without  knowing  those  assumptions,  it  is  impossi¬ 
ble  to  make  theoretical  arguments  for  or  against  any  par¬ 


ticular  measure.  Almost  all  of  the  comparisons  and  evalu¬ 
ations  of  previous  similarity  measures  have  been  based  on 
empirical  results. 

This  paper  presents  a  definition  of  similarity  that  achieves 
two  goals: 

Universality:  We  define  similarity  in  information- 
theoretic  terms.  It  is  applicable  as  long  as  the  domain 
has  a  probabilistic  model.  Since  probability  theory 
can  be  integrated  with  many  kinds  of  knowledge 
representations,  such  as  first  order  logic  [Bacchus, 
1988]  and  semantic  networks  [Pearl,  1988],  our  def¬ 
inition  of  similarity  can  be  applied  to  many  different 
domains  where  very  different  similarity  measures  had 
previously  been  proposed.  Moreover,  the  universality 
of  the  definition  also  allows  the  measure  to  be  used  in 
domains  where  no  similarity  measure  has  previously 
been  proposed,  such  as  the  similarity  between  ordinal 
values. 

Theoreticaljustification:  The  similarity  measure  is  not 
defined  directly  by  a  formula.  Rather,  it  is  derived 
from  a  set  of  assumptions  about  similarity.  In  other 
words,  if  the  assumptions  are  deemed  reasonable,  the 
similarity  measure  necessarily  follows. 

The  remainder  of  this  paper  is  organized  as  follows.  The 
next  section  presents  the  derivation  of  a  similarity  mea¬ 
sure  from  a  set  of  assumptions  about  similarity.  Sections  3 
through  6  demonstrate  the  universality  of  our  proposal  by 
applying  it  to  different  domains.  The  properties  of  different 
similarity  measures  are  compared  in  Section  7. 

2  Definition  of  Similarity 

Since  our  goal  is  to  provide  a  formal  definition  of  the  in¬ 
tuitive  concept  of  similarity,  we  first  clarify  our  intuitions 
about  similarity. 
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Intuition  1:  The  similarity  between  A  and  B  is  related 
to  their  commonality.  The  more  commonality  they 
share,  the  more  similar  they  are. 

Intuition  2:  The  similarity  between  A  and  B  is  related  to 
the  differences  between  them.  The  more  differences 
they  have,  the  less  similar  they  are. 

Intuition  3:  The  maximum  similarity  between  A  and  B  is 
reached  when  A  and  B  are  identical,  no  matter  how 
much  commonality  they  share. 

Our  goal  is  to  arrive  at  a  definition  of  similarity  that  cap¬ 
tures  the  above  intuitions.  However,  there  are  many  alter¬ 
native  ways  to  define  similarity  that  would  be  consistent 
with  the  intuitions.  In  this  section,  we  first  make  a  set  of 
additional  assumptions  about  similarity  that  we  believe  to 
be  reasonable.  A  similarity  measure  can  then  be  derived 
from  those  assumptions. 

In  order  to  capture  the  intuition  that  the  similarity  of  two 
objects  are  related  to  their  commonality,  we  need  a  measure 
of  commonality.  Our  first  assumption  is: 

Assumption  1;  The  commonality  between  A  and  B  is  mea¬ 
sured  by 

7(common(A,  B)) 

where  common(A,  B)  is  a  proposition  that  states  the  com¬ 
monalities  between  A  and  B;  7(s)  is  the  amount  of  infor¬ 
mation  contained  in  a  proposition  s. 

For  example,  if  A  is  an  orange  and  B  is  an  apple.  The 
proposition  that  states  the  commonality  between  A  and  B 
is  “fruit(A)  and  fruit(B)”.  In  information  theory  [Cover  and 
Thomas,  1991],  the  information  contained  in  a  statement 
is  measured  by  the  negative  logarithm  of  the  probability  of 
the  statement.  Therefore, 

7(common(A,  H))  =  —  log  P(fruit(A)  andfruit(P)) 

We  also  need  a  measure  of  the  differences  between  two  ob¬ 
jects.  Since  knowing  both  the  commonalities  and  the  dif¬ 
ferences  between  A  and  B  means  knowing  what  A  and  B 
are,  we  assume: 

Assumption  2:  The  differences  between  A  and  B  is  mea¬ 
sured  by 

7(description(A,  P))  —  7(common(A,  B)) 
where  description(A,  B)  is  a  proposition  that  describes 
what  A  and  B  are. 

Intuition  1  and  2  state  that  the  similarity  between  two  ob¬ 
jects  are  related  to  their  commonalities  and  differences.  We 
assume  that  commonalities  and  differences  are  the  only  fac¬ 
tors. 

Assumption  3:  The  similarity  between  A  and  B, 
sim(A,  B),  is  a  function  of  their  commonalities  and  dif¬ 


ferences.  That  is, 

sim(A,B)  =  /(7(common(A,  B)),7(description(A,B))) 
The  domain  of  /  is  {(a:,  j/)|a:  >0,y>0,y>x}. 

Intuition  3  states  that  the  similarity  measure  reaches  a  con¬ 
stant  maximum  when  the  two  objects  are  identical.  We  as¬ 
sume  the  constant  is  1. 

Assumption  4:  The  similarity  between  a  pair  of  identical 
objects  is  1. 

When  A  and  B  are  identical,  knowing  their  commonalities 
means  knowing  what  they  are,  i.e.,  7(common(A,  B))  = 
7(description(A,  B)).  Therefore,  the  function  /  must  have 
the  property:  Vx  >  0,  /(x,  x)  =  1. 

When  there  is  no  commonality  between  A  and  B,  we  as¬ 
sume  their  similarity  is  0,  no  matter  how  different  they  are. 
For  example,  the  similarity  between  “depth-first  search” 
and  “leather  sofa”  is  neither  higher  nor  lower  than  the  sim¬ 
ilarity  between  “rectangle”  and  “interest  rate”. 

Assumption  5:  Vy  >  0,  /(O,  y)  —  0. 

Suppose  two  objects  A  and  B  can  be  viewed  from  two  in¬ 
dependent  perspectives.  Their  similarity  can  be  computed 
separately  from  each  perspective.  For  example,  the  simi¬ 
larity  between  two  documents  can  be  calculated  by  com¬ 
paring  the  sets  of  words  in  the  documents  or  by  compar¬ 
ing  their  stylistic  parameter  values,  such  as  average  word 
length,  average  sentence  length,  average  number  of  verbs 
per  sentence,  etc.  We  assume  that  the  overall  similarity  of 
the  two  documents  is  a  weighted  average  of  their  similari¬ 
ties  computed  from  different  perspectives.  The  weights  are 
the  amounts  of  information  in  the  descriptions.  In  other 
words,  we  make  the  following  assumption: 

Assumption  6: 

Vxi  <  yi,X2  <  j/2  :  fixi  +X2,yi  -f  2/2)  = 
^f{xi,yi)  +  ^f{x2,y2) 

From  the  above  assumptions,  we  can  proved  the  following 
theorem: 

Similarity  Theorem:  The  similarity  between  A  and  B  is 
measured  by  the  ratio  between  the  amount  of  information 
needed  to  state  the  commonality  of  A  and  B  and  the  infor¬ 
mation  needed  to  fully  describe  what  A  and  B  are: 

log  P(descnption(A,  B)) 

Proof: 

=  f{x  +  0,x  +  iy  -x)) 

=  I  X  f{x,  x)  -f-  ^  X  /(O,  y-x)  (Assumption  6) 
=  I  X  1  +  ^0=1  (Assumption  4  and  5) 

Q.E.D. 
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Since  similarity  is  the  ratio  between  the  amount  of  infor¬ 
mation  in  the  commonality  and  the  amount  of  information 
in  the  description  of  the  two  objects,  if  we  know  the  com¬ 
monality  of  the  two  objects,  their  similarity  tells  us  how 
much  more  information  is  needed  to  determine  what  these 
two  objects  are. 

In  the  next  4  sections,  we  demonstrate  how  the  above  defi¬ 
nition  can  be  applied  in  different  domains. 

3  Similarity  between  Ordinal  Values 

Many  features  have  ordinal  values.  For  example,  the  “qual¬ 
ity”  attribute  can  take  one  of  the  following  values  “excel¬ 
lent”,  “good”,  “average”,  “bad”,  or  “awful”.  None  of  the 
previous  definitions  of  similarity  provides  a  measure  for 
the  similarity  between  two  ordinal  values.  We  now  show 
how  our  definition  can  be  applied  here. 

If  “the  quality  of  X  is  excellent”  and  “the  quality  of  Y  is 
average”,  the  maximally  specific  statement  that  can  be  said 
of  both  X  and  Y  is  that  “the  quality  of  X  and  Y  are  between 
“average”  and  “excellent”.  Therefore,  the  commonality  be¬ 
tween  two  ordinal  values  is  the  interval  delimited  by  them. 

Suppose  the  distribution  of  the  “quality”  attribute  is  known 
(Figure  1).  The  following  are  four  examples  of  similarity 
calculations: 

&im(excellent  aood)  = 

t>im{exceHeni,  gooa)  -  p(ej;ce(/ene)-t-log  Pigood) 

_  2xlOK(0.05-f-0.10)  _  n  79 
~  log 0.05-l-log 0.10 

sim{good,average)  = 

_  2xlog(0.10-l-0.50)  _  n  04 
“  Iog0.10-t-log0.50  — 

.  /  It  j.  \  2xlog  PfeiceKentVgoodVorerooe) 

Sim{excellent,average)  =  iogP(elcellent)+\lgP(averagW 
_  2xlog(0.05-)-0.10-l-0.50)  _  n  90 
~  log  0.05-t-log  0.50  —  U.ZO 

•  /  J  L  2x\ok  PigoodWaverageWbad) 

sim{good,bad)  =  lol pllood)Zg  P{bad)  ^ 

_  2xlog(0.l0-t-0.50-t-.20)  _  n  1 1 
~  Iog0.10-|-log0.20  ~ 

The  results  show  that,  given  the  probability  distribution  in 
Figure  1,  the  similarity  between  “excellent”  and  “good”  is 
much  higher  than  the  similarity  between  “good”  and  “av¬ 
erage”;  the  similarity  between  “excellent”  and  “average”  is 
much  higher  than  the  similarity  between  “good”  and  “bad”. 

4  Feature  Vectors 

Feature  vectors  are  one  of  the  simplest  and  most  commonly 
used  forms  of  knowledge  representation,  especially  in  case- 
based  reasoning  [Aha  et  al.,  199 1 ;  Stanfill  and  Waltz,  1986] 
and  machine  learning.  Weights  are  often  assigned  to  fea¬ 
tures  to  account  for  the  fact  that  the  dissimilarity  caused 
by  more  important  features  is  greater  than  the  dissimilarity 


Figure  1 :  Example  Distribution  of  Ordinal  Values 


caused  by  less  important  features.  The  assignment  of  the 
weight  parameters  is  generally  heuristic  in  nature  in  pre¬ 
vious  approaches.  Our  definition  of  similarity  provides  a 
more  principled  approach,  as  demonstrated  in  the  follow¬ 
ing  case  study. 

4.1  String  Similarity — A  case  study 

Consider  the  task  of  retrieving  from  a  word  list  the  words 
that  are  derived  from  the  same  root  as  a  given  word.  For 
example,  given  the  word  “eloquently”,  our  objective  is  to 
retrieve  the  other  related  words  such  as  “ineloquent”,  “in- 
eloquently”,  “eloquent”,  and  “eloquence”.  To  do  so,  as¬ 
suming  that  a  morphological  analyzer  is  not  available,  one 
can  define  a  similarity  measure  between  two  strings  and 
rank  the  words  in  the  word  list  in  descending  order  of  their 
similarity  to  the  given  word.  The  similarity  measure  should 
be  such  that  words  derived  from  the  same  root  as  the  given 
word  should  appear  early  in  the  ranking. 

We  experimented  with  three  similarity  measures.  The  first 
one  is  defined  as  follows: 

simedit(a:,  y)  ^  ^  editDist(a;,  y) 

where  editDist(x,  y)  is  the  minimum  number  of  character 
insertion  and  deletion  operations  needed  to  transform  one 
string  to  the  other. 

The  second  similarity  measure  is  based  on  the  number  of 
different  trigrams  in  the  two  strings: 

simm(a:,y)  -  ^  ltri(x)|  -1-  ltri(j/)|  -  2  x  |tri(x)  n  tri(y)| 

where  tri(x)  is  the  set  of  trigrams  in  x.  For  example, 
tri(eloquent)  =  {elo,  loq,  oqu,  que,  ent}. 
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Table  1:  Top- 10  Most  Similar  Words  to  “grandiloquent’ 


Rank 

siniedit 

simni 

sim 

1 

grandiloquently 

1/3 

grandiloquently 

1/2 

grandiloquently 

0.92 

2 

grandiloquence 

1/4 

grandiloquence 

1/4 

grandiloquence 

0.89 

3 

magniloquent 

1/6 

eloquent 

1/8 

eloquent 

0.61 

4 

gradient 

1/6 

grand 

1/9 

magniloquent 

0.59 

5 

grandaunt 

1/7 

grande 

1/10 

ineloquent 

0.55 

6 

gradients 

1/7 

rand 

1/10 

eloquently 

0.55 

7 

grandiose 

1/7 

magniloquent 

1/10 

ineloquently 

0.50 

8 

diluent 

1/7 

ineloquent 

1/10 

magniloquence 

0.50 

9 

ineloquent 

1/8 

grands 

1/10 

eloquence 

0.50 

10 

grandson 

1/8 

eloquently 

1/10 

ventriloquy 

0.42 

Table  2:  Evaluation  of  String  Similarity  Measures 


Root 

Meaning 

\Wroot\ 

1 1 -point  average  precisions 

sirnie(jit 

sim^i 

sim 

agog 

leader,  leading,  bring 

23 

37% 

40% 

70% 

cardi 

heart 

56 

18% 

21% 

47% 

circum 

around,  surrounding 

58 

24% 

19% 

68% 

gress 

to  step,  to  walk,  to  go 

84 

22% 

31% 

52% 

loqu 

to  speak 

39 

19% 

20% 

57% 

The  third  similarity  measure  is  based  on  our  proposed  defi¬ 
nition  of  similarity  under  the  assumption  that  the  probabil¬ 
ity  of  a  trigram  occurring  in  a  word  is  independent  of  other 
trigrams  in  the  word: 

Simla:  v)  =  ^  ^  Et€tri(a:)ntri(y) 

^  E,etri(.)logPW  +  Etetri(.)logP(f) 

Table  1  shows  the  top  10  most  similar  words  to  “grandilo¬ 
quent”  according  to  the  above  three  similarity  measures. 

To  determine  which  similarity  measure  ranks  higher  the 
words  that  are  derived  from  the  same  root  as  the  given 
word,  we  adopted  the  evaluation  metrics  used  in  the  Text 
Retrieval  Conference  [Harman,  1993].  We  used  a  109,582- 
word  list  from  the  AI  Repository.*  The  probabilities  of 
trigrams  are  estimated  by  their  frequencies  in  the  words. 
Let  W  denote  the  set  of  words  in  the  word  list  and  Wroot 
denote  the  subset  of  W  that  are  derived  from  root.  Let 
{wi , Wn)  denote  the  ordering  of  W  —  {ty}  in  de¬ 
scending  similarity  to  w  according  to  a  similarity  measure. 
The  precision  of  (wi w„)  at  recall  level  N%  is  de¬ 
fined  as  the  maximum  value  of  such  that 

ke  n}  and  ”"^>1  >  iV%.  The  qual¬ 
ity  of  the  sequence  (wi  can  be  measured  by  the 

‘http;//www.cs.cmu.edu/afs/cs/project/ai-repository 


1 1-point  average  of  its  precisions  at  recall  levels  0%,  10%, 
20%, ...,  and  100%.  The  average  precision  values  are  then 
averaged  over  all  the  words  in  Wroot-  The  results  on  5 
roots  are  shown  in  Table  2.  It  can  be  seen  that  much  better 
results  were  achieved  with  sim  than  with  the  other  similar¬ 
ity  measures.  The  reason  for  this  is  that  sinticait  and  simtn 
treat  all  characters  or  trigrams  equally,  whereas  sim  is  able 
to  automatically  take  into  account  the  varied  importance  in 
different  trigrams. 

5  Word  Similarity 

In  this  section,  we  show  how  to  measure  similarities  be¬ 
tween  words  according  to  their  distribution  in  a  text  corpus 
[Pereira  etal.,  1993].  Similar  to  [Alshawi  and  Carter,  1994; 
Grishman  and  Sterling,  1994;  Ruge,  1992],  we  use  a  parser 
to  extract  dependency  triples  from  the  text  corpus.  A  de¬ 
pendency  triple  consists  of  a  head,  a  dependency  type  and 
a  modifier.  For  example,  the  dependency  triples  in  “I  have 
a  brown  dog”  consist  of: 

(1)  (have  subj  I),  (have  obj  dog),  (dog  adj-mod  brown), 
(dogdeta) 

where  “subj”  is  the  relationship  between  a  verb  and  its  sub¬ 
ject;  “obj”  is  the  relationship  between  a  verb  and  its  object; 
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“adj-mod’‘  is  the  relationship  between  a  noun  and  its  adjec¬ 
tive  modifier  and  “det”  is  the  relationship  between  a  noun 
and  its  determiner. 

We  can  view  dependency  triples  extracted  from  a  corpus 
as  features  of  the  heads  and  modifiers  in  the  triples.  Sup¬ 
pose  (avert  obj  duty)  is  a  dependency  triple,  we  say  that 
“duty”  has  the  feature  obj-of(avert)  and  “avert”  has  the  fea¬ 
ture  obj(duty).  Other  words  that  also  possess  the  feature 
obJ-of(avert)  include  “default”,  “crisis”,  “eye”,  “panic”, 
“strike”,  “war”,  etc.,  which  are  also  used  as  objects  of 
“avert”  in  the  corpus. 

Table  3  shows  a  subset  of  the  features  of  “duty”  and  “sanc¬ 
tion”.  Each  row  corresponds  to  a  feature.  A  ‘x’  in  the 
“duty”  or  “sanction”  column  means  that  the  word  possesses 
that  feature. 


Table  3:  Features  of  “duty”  and  “sanction” 


Feature 

duty 

sanction 

/i;  subj-of(include) 

X 

X 

3.15 

/a;  obj-of(assume) 

X 

5.43 

/a;  obj-of(avert) 

X 

X 

5.88 

fi  '.  obj-of(ease) 

X 

4.99 

/s;  obj-of(impose) 

X 

X 

4.97 

fe'.  adj-mod(fiduciary) 

X 

7.76 

/r;  adj-mod(punitive) 

X 

X 

7.10 

fa:  adj-mod(economic) 

X 

3.70 

mation  in  subj-of(include),  3.15.  This  agrees  with  our  intu¬ 
ition  that  saying  that  a  word  can  be  modified  by  “fiduciary” 
is  more  informative  than  saying  that  the  word  can  be  the 
subject  of  “include”. 

The  fourth  column  in  Table  3  shows  the  amount  of  infor¬ 
mation  contained  in  each  feature.  If  the  features  in  Table  3 
were  all  the  features  of  “duty”  and  “sanction”,  the  similar¬ 
ity  between  “duty”  and  “sanction”  would  be; 

_ 2  X  /({/l,/3,/5i/7}) _ 

f2,  fs,  fb,  fe,  h})  +  h,  f4,  h,  fi,  fs}) 
which  is  equal  to  0.66. 

We  parsed  a  22-million-word  corpus  consisting  of  Wall 
Street  Journal  and  San  Jose  Mercury  with  a  principle-based 
broad-coverage  parser,  called  PRINCIPAR  [Lin,  1993; 
Lin,  1994].  Parsing  took  about  72  hours  on  a  Pentium 
200  with  80MB  memory.  From  these  parse  trees  we  ex¬ 
tracted  about  14  million  dependency  triples.  The  frequency 
counts  of  the  dependency  triples  are  stored  and  indexed  in 
a  62MB  dependency  database,  which  constitutes  the  set  of 
feature  descriptions  of  all  the  words  in  the  corpus.  Using 
this  dependency  database,  we  computed  pairwise  similarity 
between  5230  nouns  that  occurred  at  least  50  times  in  the 
corpus. 

The  words  with  similarity  to  “duty”  greater  than  0.04  are 
listed  in  (3)  in  descending  order  of  their  similarity. 


Let  F{w)  be  the  set  of  features  possessed  by  w.  F{w)  can 
be  viewed  as  a  description  of  the  word  w.  The  commonali¬ 
ties  between  two  words  Wi  and  W2  is  then  F{wi )  n  F{w2). 

The  similarity  between  two  words  is  defined  as  follows; 

(2)  simflUi  Wo)  -  2x/(F(iyi)nF(u)i)) 

(Z)  Sim(Wi,W2)  - 


where  7(5)  is  the  amount  of  information  contained  in  a  set 
of  features  5.  Assuming  that  features  are  independent  of 
one  another,  7(5)  =  -  where  P{f)  is  the 

probability  of  feature  /.  \^en  two  words  have  identical 
sets  of  features,  their  similarity  reaches  the  maximum  value 
of  1.  The  minimum  similarity  0  is  reached  when  two  words 
do  not  have  any  common  feature. 


The  probability  P{f)  can  be  estimated  by  the  percentage 
of  words  that  have  feature  /  among  the  set  of  words  that 
have  the  same  part  of  speech.  For  example,  there  are  32868 
unique  nouns  in  a  corpus,  1405  of  which  were  used  as  sub¬ 
jects  of  “include”.  The  probability  of  subj-of(include)  is 
The  probability  of  the  feature  adj-mod(fiduciary)  is 
because  only  14  (unique)  nouns  were  modified  by 
“fiduciary”.  The  amount  of  information  in  the  feature  adj- 
mod(fiduciary),  7.76,  is  greater  than  the  amount  of  infor¬ 


(3)  responsibility,  position,  sanction,  tariff,  obligation, 
fee,  post,  job,  role,  tax,  penalty,  condition,  function, 
assignment,  power,  expense,  task,  deadline,  training, 
work,  standard,  ban,  restriction,  authority, 
commitment,  award,  liability,  requirement,  staff, 
membership,  limit,  pledge,  right,  chore,  mission, 
care,  title,  capability,  patrol,  fine,  faith,  seat,  levy, 
violation,  load,  salary,  attitude,  bonus,  schedule, 
instruction,  rank,  purpose,  personnel,  worth, 
jurisdiction,  presidency,  exercise. 

The  following  is  the  entry  for  “duty”  in  the  Random  House 
Thesaurus  [Stein  and  Flexner,  1984]. 

(4)  duty  n.  1 .  obligation  ,  responsibility  ;  onus; 
business,  province;  2.  function  ,  task  ,  assignment , 
charge.  3.  tax  ,  tariff ,  customs,  excise,  levy  . 

The  shadowed  words  in  (4)  also  appear  in  (3).  It  can  be 
seen  that  our  program  captured  all  three  senses  of  “duty”  in 
[Stein  and  Flexner,  1984]. 

Two  words  are  a  pair  of  respective  nearest  neighbors 
(RNNs)  if  each  is  the  other’s  most  similar  word.  Our  pro¬ 
gram  found  622  pairs  of  RNNs  among  the  5230  nouns  that 
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Table  4:  Respective  Nearest  Neighbors 


Rank 

RNN 

Sim 

1 

earnings  profit 

0.50 

11 

revenue  sale 

0.39 

21 

acquisition  merger 

0.34 

31 

attorney  lawyer 

0.32 

41 

data  information 

0.30 

51 

amount  number 

0.27 

61 

downturn  slump 

0.26 

71 

there  way 

0.24 

81 

fear  worry 

0.23 

91 

jacket  shirt 

0.22 

101 

film  movie 

0.21 

111 

felony  misdemeanor 

0.21 

121 

importance  significance 

0.20 

131 

reaction  response 

0.19 

141 

heroin  marijuana 

0.19 

151 

championship  tournament 

0.18 

161 

consequence  implication 

0.18 

171 

rape  robbery 

0.17 

181 

dinner  lunch 

0.17 

191 

turmoil  upheaval 

0.17 

201 

biggest  largest 

0.17 

211 

blaze  fire 

0.16 

221 

captive  westerner 

0.16 

231 

imprisonment  probation 

0.16 

241 

apparel  clothing 

0.15 

251 

comment  elaboration 

0.15 

261 

disadvantage  drawback 

0.15 

271 

infringement  negligence 

0.15 

281 

angler  fishermen 

0.14 

291 

emission  pollution 

0.14 

301 

granite  marble 

0.14 

311 

gourmet  vegetarian 

0.14 

321 

publicist  stockbroker 

0.14 

331 

maternity  outpatient 

0.13 

341 

artillery  warplanes 

0.13 

351 

psychiatrist  psychologist 

0.13 

361 

blunder  fiasco 

0.13 

371 

door  window 

0.13 

381 

counseling  therapy 

0.12 

391 

austerity  stimulus 

0.12 

401 

ours  yours 

0.12 

411 

procurement  zoning 

0.12 

421 

neither  none 

0.12 

431 

briefcase  wallet 

0.11 

441 

audition  rite 

0.11 

451 

nylon  silk 

0.11 

461 

columnist  commentator 

0.11 

471 

avalanche  raft 

0.11 

481 

herb  olive 

0.11 

491 

distance  length 

0.10 

501 

interruption  pause 

0.10 

511 

ocean  sea 

0.10 

521 

flying  watching 

0.10 

531 

ladder  spectrum 

0.09 

541 

lotto  poker 

0.09 

551 

camping  skiing 

0.09 

561 

lip  mouth 

0.09 

571 

mounting  reducing 

0.09 

581 

pill  tablet 

0.08 

591 

choir  troupe 

0.08 

601 

conservatism  nationalism 

0.08 

611 

bone  flesh 

0.07 

621 

powder  spray 

0.06 

occurred  at  least  50  times  in  the  parsed  corpus.  Table  4 
shows  every  10th  RNN. 

Some  of  the  pairs  may  look  peculiar.  Detailed  examination 
actually  reveals  that  they  are  quite  reasonable.  For  exam¬ 
ple,  the  221  ranked  pair  is  “captive”  and  “westerner”.  It  is 
very  unlikely  that  any  manually  created  thesaurus  will  list 
them  as  near-synonyms.  We  manually  examined  all  274  oc¬ 
currences  of  “westerner”  in  the  corpus  and  found  that  55% 
of  them  refer  to  westerners  in  captivity.  Some  of  the  bad 
RNNs,  such  as  (avalanche,  raft),  (audition,  rite),  were  due 
to  their  relative  low  frequencies,^  which  make  them  sus¬ 
ceptible  to  accidental  commonalities,  such  as: 


(5)  The  {avalanche,  raft}  {drifted,  hit}  .... 

To  {hold,  attend}  the  {audition,  rite}. 

An  uninhibited  {audition,  rite}. 

6  Semantic  Similarity  in  a  Taxonomy 


Semantic  similarity  [Resnik,  1995b]  refers  to  similarity  be¬ 
tween  two  concepts  in  a  taxonomy  such  as  the  WordNet 
[Miller,  1990]  or  CYC  upper  ontology.  The  semantic  simi¬ 
larity  between  two  classes  C  and  C  is  not  about  the  classes 
themselves.  When  we  say  “rivers  and  ditches  are  simi¬ 
lar”,  we  are  not  comparing  the  set  of  rivers  with  the  set 
of  ditches.  Instead,  we  are  comparing  a  generic  river  and 
a  generic  ditch.  Therefore,  we  define  sim(C',  C)  to  be  the 
similarity  between  x  and  x'  if  all  we  know  about  x  and  x' 
is  that  X  £  C  and  x'  G  C. 

The  two  statements  “x  £  C"  and  “x'  £  C”  are  indepen¬ 
dent  (instead  of  being  assumed  to  be  independent)  because 
the  selection  of  a  generic  C  is  not  related  to  the  selection 
of  a  generic  C.  The  amount  of  information  contained  in 
“a:  £  C  and  x'  £  C"  is 


-logP(C')-logP(C") 


where  P{C)  and  P(C")  are  probabilities  that  a  randomly 
selected  object  belongs  to  C  and  C,  respectively. 

Assuming  that  the  taxonomy  is  a  tree,  ifxi£C  and  X2  G 
C2,  the  commonality  between  xi  and  X2  is  xi  G  Co  Aru2  G 
Co,  where  Co  is  the  most  specific  class  that  subsumes  both 
Cl  and  C2.  Therefore, 


sim(a:i,a:2) 


2xlogP(Co) 
log P(Ci) -blog P(C2) 


For  example.  Figure  2  is  a  fragment  of  the  WordNet.  The 
number  attached  to  each  node  C  is  P(C).  The  similarity 

^They  all  occurred  50-60  times  in  the  parsed  corpus. 
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entity  0.395 
inanimate-object  0.167 
natural-object  0.0163 


geological-formation 

0.00176 

0.000113 

natural-elevation  shore 

0.0000836 

0.0000189 

/ 

hill  coast 

0.0000216 

Figure  2;  A  Fragment  of  WordNet 


between  the  concepts  of  Hill  and  Coast  is: 


sim(Hill,  Coast) 


2  X  logP(Geological-Formation) 
logP(Hill)  -I-  logP(Coast) 


which  is  equal  to  0.59. 

There  have  been  many  proposals  to  use  the  distance  be¬ 
tween  two  concepts  in  a  taxonomy  as  the  basis  for  their 
similarity  [Lee  et  al.,  1989;  Rada  et  al.,  1989].  Resnik 
[Resnik,  1995b]  showed  that  the  distance-based  similar¬ 
ity  measures  do  not  correlate  to  human  judgments  as 
well  as  his  measure.  Resnik’s  similarity  measure  is 
quite  close  to  the  one  proposed  here;  simResnik(>l,  B)  = 
|/(common(A,  B)).  For  example,  in  Figure  2, 
siniResnik  (Hill,  Coast)  =  —  logP(Geological-Formation). 

Wu  and  Palmer  [Wu  and  Palmer,  1994]  proposed  a  measure 
for  semantic  similarity  that  could  be  regarded  as  a  special 
case  of  sim(A,  B): 


2  X  As 

Simwu&Palmer(^,P)  -  As  +  2  X  A3 

where  Ai  and  As  are  the  number  of  IS- A  links  from  A  and 
B  to  their  most  specific  common  superclass  C;  A3  is  the 
number  of  IS-A  links  from  C  to  the  root  of  the  taxonomy. 
For  example,  the  most  speeific  eommon  superclass  of  Hill 
and  Coast  is  Geological-Formation.  Thus,  Ai  =  2,  As  = 
2,  A3  =  3  and  simwu&Paimcr(Hill,  Coast)  =  0.6. 

Interestingly,  if  P(C'|C")  is  the  same  for  all  pairs  of  con¬ 
cepts  such  that  there  is  an  IS-A  link  from  C  to  C  in  the 
taxonomy,  simwu&Paimer(>l,  B)  coineides  with  sim(A,  B). 

Resnik  [Resnik,  1995a]  evaluated  three  different  similar¬ 
ity  measures  by  eorrelating  their  similarity  seores  on  28 
pairs  of  coneepts  in  the  WordNet  with  assessments  made 
by  human  subjects  [Miller  and  Charles,  1991].  We  adopted 


Table  5:  Results  of  Comparison  between  Semantic  Simi¬ 
larity  Measures 


Word  Pair 

Miller* 

Charles 

Resnik 

Wu& 

Palmer 

sim 

car,  automobile 

3.92 

11.630 

1.00 

1.00 

gem,  jewel 

3.84 

15.634 

1.00 

1.00 

journey,  voyage 

3.84 

11.806 

.91 

.89 

boy,  lad 

3.76 

7.003 

.90 

.85 

coast,  shore 

3.70 

9.375 

.90 

.93 

asylum,  madhouse 

3.61 

13.517 

.93 

.97 

magician,  wizard 

3.50 

8.744 

1.00 

1.00 

midday,  noon 

3.42 

11.773 

1.00 

1.00 

furnace,  stove 

3.11 

2.246 

.41 

.18 

food,  fruit 

3.08 

1.703 

.33 

.24 

bird,  cock 

3.05 

8.202 

.91 

.83 

bird,  crane 

2.97 

8.202 

.78 

.67 

tool,  implement 

2.95 

6.136 

.90 

.80 

brother,  monk 

2.82 

1.722 

.50 

.16 

crane,  implement 

1.68 

3.263 

.63 

.39 

lad,  brother 

1.66 

1.722 

.55 

.20 

journey,  car 

1.16 

0 

0 

0 

monk,  oracle 

1.10 

1.722 

.41 

.14 

food,  rooster 

0.89 

.538 

.7 

.04 

coast,  hill 

0.87 

6.329 

.63 

.58 

forest,  graveyard 

0.84 

0 

0 

0 

monk,  slave 

0.55 

1.722 

.55 

.18 

coast,  forest 

0.42 

1.703 

.33 

.16 

lad, wizard 

0.42 

1.722 

.55 

.20 

chord,  smile 

0.13 

2.947 

.41 

.20 

glass,  magician 

0.11 

.538 

.11 

.06 

noon,  string 

0.08 

0 

0 

0 

rooster,  voyage 

0.08 

0 

0 

0 

Correlation  with 
Miller  &  Charles 

1.00 

0.795 

0.803 

0.834 

the  same  data  set  and  evaluation  methodology  to  compare 
simResnik-  simwu&Painwr  and  sim.  Table  5  shows  the  simi¬ 
larities  between  28  pairs  of  coneepts,  using  three  different 
similarity  measures.  Column  Miller&Charles  lists  the  av¬ 
erage  similarity  scores  (on  a  seale  of  0  to  4)  assigned  by 
human  subjects  in  Miller&Charles’s  experiments  [Miller 
and  Charles,  1991].  Our  definition  of  similarity  yielded 
slightly  higher  correlation  with  human  judgments  than  the 
other  two  measures. 


7  Comparison  between  Different  Similarity 
Measures 


One  of  the  most  commonly  used  similarity  measure  is 
call  Dice  coefficient.  Suppose  two  objects  can  be  de¬ 
scribed  with  two  numerical  vectors  (01,02, ...  ,a„)  and 
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Table  6:  Comparison  between  Similarity  Measures 


Similarity  Measures: 

WP:  simwu&Palmer 

R:  simResnik 

Dice:  simaice 

Property 

sim 

WP 

R 

Dice 

sini(]jst 

increase  with 
commonality 

yes 

yes 

yes 

yes 

no 

decrease  with 
difference 

yes 

yes 

no 

yes 

yes 

triangle 

inequality 

no 

no 

no 

no 

yes 

Assumption  6 

yes 

yes 

no 

yes 

no 

max  value=l 

yes 

yes 

no 

yes 

yes 

semantic 

similarity 

yes 

yes 

yes 

no 

yes 

word 

similarity 

yes 

no 

no 

yes 

yes 

ordinal 

values 

yes 

no 

no 

no 

no 

ibi,b2,---,bn),  their  Dice  coefficient  is  defined  as 


simdice(^, -B) 


2  X  Ei=^ib 


Si=l,n 


Another  class  of  similarity  measures  is  based  a  distance 
metric.  Suppose  dist{A,  B)  is  a  distance  metric  between 
two  objects,  simdist  can  be  defined  as  follows: 


their  shades,  B  and  C  are  similar  in  their  shape,  but  A  and 
C  are  not  similar. 

A 


Figure  3:  Counter-example  of  Triangle  Inequality 


Assumption  6:  The  strongest  assumption  that  we  made  in 
Section  2  is  Assumption  6.  However,  this  assumption  is 
not  unique  to  our  proposal.  Both  simwu&Paimer  and  sinidice 
also  satisfy  Assumption  6.  Suppose  two  objects  A  and  B 
are  represented  by  two  feature  vectors  (oi ,  02 , . . . ,  o„)  and 
(61, 62,  •  •  • )  bn),  respectively.  Without  loss  of  generality, 
suppose  the  first  k  features  and  the  rest  n  —  k  features  rep¬ 
resent  two  independent  perspectives  of  the  objects. 


,  2x5^,  ,  aibi 

simdice(A,B)  =  ^  = 

A-^t=sl,n 

\  ''  '  6?  2 X ^  ^  (tibi 

T  a?+7  6?y - - if'*' 

V  o?-i-y''  5?  Y  s+v  6? 

*  <C^is:h  +  \,n  *  +  * 


which  is  a  weighted  average  of  the  similarity  between  A 
and  B  in  each  of  the  two  perspectives. 

Maximum  Similarity  Values:  With  most  similarity  mea¬ 
sures,  the  maximum  similarity  is  1,  except  simResnik.  which 
have  no  upper  bound  for  similarity  values. 

Application  Domains:  The  similarity  measure  proposed  in 
this  paper  can  be  applied  in  all  the  domains  listed  in  Table 
6,  including  the  similarity  of  ordinal  values,  where  none  of 
the  other  similarity  measures  is  applicable. 


Table  6  summarizes  the  comparison  among  5  similarity 
measures. 

Commonality  and  Difference:  While  most  similarity 
measures  increase  with  commonality  and  decrease  with 
difference,  simdist  only  decreases  with  difference  and 
simResnik  Only  takes  commonality  into  account. 

Triangle  Inequality:  A  distance  metrics  must  satisfy  the 
triangle  inequality: 

dist{A,  C)  <  dist{A,  B)  +  dist{B,  C). 
Consequently,  simdist  has  the  property  that  simdist  (.A,  (7) 
cannot  be  arbitrarily  close  to  0  if  none  of  simdist(.4,  B)  and 
simdist  (5,(7)  is  0.  This  can  be  counter-intuitive  in  some 
situations.  For  example,  in  Figure  3,  A  and  B  are  similar  in 


8  Conclusion 

Similarity  is  an  important  and  fundamental  concept  in  AI 
and  many  other  fields.  Previous  proposals  for  similarity 
measures  are  heuristic  in  nature  and  tied  to  a  particular  do¬ 
main  or  form  of  knowledge  representation.  In  this  paper, 
we  present  a  universal  definition  of  similarity  in  terms  of 
information  theory.  The  similarity  measure  is  not  directly 
stated  as  in  earlier  definitions,  rather,  it  is  derived  from  a 
set  of  assumptions.  In  other  words,  if  one  accepts  the  as¬ 
sumptions,  the  similarity  measure  necessarily  follows.  The 
universality  of  the  definition  is  demonstrated  by  its  applica¬ 
tions  in  different  domains  where  different  similarity  mea¬ 
sures  have  been  employed  before. 
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Abstract 


This  paper  defines  a  formal  approach  to  learning 
from  examples  described  by  labelled  graphs.  We 
propose  a  formal  model  based  upon  lattice  theory 
and  in  particular  with  the  use  of  Galois  lattice. 
We  enlarge  the  domain  of  formal  concept 
analysis,  by  the  use  of  the  Galois  lattice  model 
with  structural  description  of  examples  and 
concepts.  Our  implementation,  called  "Graal"  (for 
GRAph  And  Learning)  constructs  a  Galois  lattice 
for  any  description  language  provided  that  the 
two  operations  of  comparison  and  generalization 
are  determined  for  that  language.  We  prove  that 
these  operations  exist  in  the  case  of  labelled 
graphs. 


1.  INTRODUCTION 

The  Galois  lattice  is  the  foundation  of  a  set  of  conceptual 
classification  methods.  This  approach  defined  by  Barbut 
and  Monjardet  (Barbut,  1970),  was  popularized  by 
(Wille,  1982),  (Wille,  1992),  who  used  this  structure  as 
the  kernel  of  formal  concept  analysis. 

Wille  proposed  considering  each  node  of  a  Galois  lattice 
as  a  formal  concept.  Each  node  has  two  parts:  the 
extension  (a  subset  of  the  examples)  and  the  intension  (the 
description).  In  addition,  the  lattice  gives  the  relations 
(generalisation/specialization)  between  concepts.  An 
advantages  of  this  formalization  is  a  good  description  of 
the  concept  space.  Additionallly,  there  are  many  methods 
for  the  construction  of  such  lattice  (depth  first  search 
(Bordat,  1986),  incremental  construction  (Ganter,  1988), 
(Missikoff,  1989)). 

In  the  context  of  machine  learning,  the  automatic 
construction  of  such  a  hierarchy  can  be  viewed  as  an 
unsupervised  conceptual  classification  method  as  seen  in 
(Michalski,  1982)  because  we  give  a  general  method  and 
look  for  all  the  concepts  that  can  be  extracted  fi’om  the 
examples. 


In  this  way,  research  space  is  not  limited  by  the  use  of 
parameters  although  this  method  cannot  be  used  in  all 
practical  applications.  Its  advantage  is  that  we  can  study 
precisely  the  impact  of  biais  and  heuristic. 

An  important  limitation  of  the  method  using  the  Galois 
lattice  is  the  classical  propositional  description  of  the 
examples  (Wille,  1982),  (Ganascia,  1993),  (Mephu 
Nguifo,  1 994),  (Carpineto,  1 994) .  There  is  a  great  deal  of 
research  on  the  extension  of  the  description  language: 
valued  attributes  (Wille,  1989),  (Carpineto,  1994),  term 
(Daniel-Vatonne,  1993),  graph  (Liquiere,  1989),  (Godin, 
1995). 

In  the  case  of  structural  description,  the  actual  methods 
use  a  two  step  mechanism. 

1)  the  goal  of  the  first  step  is  to  find  structures 
repeated  in  the  set  of  descriptions  of  the  examples. 

2)  the  second  step  uses  the  structures  found  and 
changes  the  description  of  the  example.  Each  structural 
description  is  converted  in  a  list  of  binary  attributes  (one 
attribute  by  structure).  An  attribute  is  true  if  the  associated 
structure  appears  in  the  example. 

•  in  our  work  (Liquiere,  1989),  (Liquiere,  1994), 
we  used  labelled  graphs,  and  the  goal  of  our  first  step  was 
to  find  repeated  paths  and  trees  in  the  description  of  the 
examples. 

•  in  the  work  of  (Daniel- Vatorme,  1993),  the 
description  language  is  based  upon  rooted  tree  (term)  and 
the  first  step  research  path. 

•  Godin,  Mineau  (Godin,  1995)  uses  a  similar 
method  with  conceptual  graphs. 

The  first  step  research  repeated  triplet  graphs  (graph  like 
<Object>-relation-<Object>).  This  limits  the  complexity 
of  the  research.  The  second  step  finds  sets  of  triplet  graphs 
viewed  in  the  same  set  of  examples,  but  the  link  between 
the  nodes  are  overlooked.  So  the  structural  descriptions  cf 
the  examples  are  not  exploited. 

In  this  paper  we  give  a  general  one  step  mechanism, 
without  changing  the  description  of  the  examples.  This 
mechanism  uses  a  generalization  operation  and  we  specify 
this  operation  for  different  classes  of  description  languages. 
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This  paper  is  organized  as  follows. 

A  generalized  method  of  learning  from  examples  is 
presented  in  section  2. 

In  section  3,  we  specify  this  model  for  description  with 
labelled  graphs. 

Then,  we  study  the  complexity  of  the  operation  in  the 
case  of  descriptions  with  labelled  graphs,  in  section  4 

Finally,  in  section  5  we  give  an  example  of  the  results 
found. 

2.  THE  GALOIS  LATTICE  AS  A 
CORRESPONDANCE  BETWEEN  TWO 
LATTICES 

Using  lattice  theory,  the  formal  framework  is  based  on  the 
use  of  two  lattices  as  in  (Ganascia,  1993).  This  model, 
uses  lattice  results  given  in  (Birkoff,  1967)  and  (Barbut, 
1970).  This  approach  was  used  in  machine  learning 
(Ganascia,  1993),  but  only  in  propositional  description. 

Ganascia  writes:  “this  framework  is  adequate  to  represent 
classical  top-down  induction  systems  ..  but  it  is  too 
restricted  to  formalize  first  order  logic  languages  ..." 

In  fact,  this  approach  can  be  used  for  structured 
description  as  well.  Thus  there  is  an  unifying  method  for 
many  types  of  description  language. 

2.1  Two  isomorph  lattices 

This  formalization  is  based  on  the  use  of  two  lattices:  the 
description  lattice  D  and  the  instance  lattice  I. 

•  The  instance  lattice  /,  corresponds  to  the  set  of 
parts  of  the  training  set  ^  and  is  ordered  by  the  inclusion 
relationship  which  is  noted  3  where  A  =3  B  means  that  A 
is  included  in  B.  Given  two  elements  a  and  b,  the  least 
upper  bound  -  and  the  greatest  lower  bound-  corresponds 
to  the  classical  union  -i.e  u  -  and  intersection  -i.e  n  -. 

•  The  description  lattice  D  contains  all  the 
possible  descriptions  ordered  by  a  >  relation.  This 
relation  >  corresponds  to  the  generalization  relationship. 

>  :  D  X  D  — >  {true,  false}  for  two  descriptions 
d  1 ,  d2£  D,  d  j  >  dj  means  that  d  [  is  more  general  than 
d2-  Let  us  just  consider  that  it  structures  the  description 
space  with  a  partial  ordering. 

From  this  relation  we  can  define  the  equivalence 
relation.  =  :  D  x  D  — >  {true,  false),  dj  =  d2  iff  (d[  >  d2) 
and  (d2  S  dj) 

Because  D  is  a  lattice,  two  elements  of  D  have  a 
least  upper  bound.  We  note  d]Ad2  this  bound.  This  is 
the  least  general  generalisation  of  dj  and  d2. 

We  have  a:  D  x  D  D,  if  d>dj  and  d>d2  then  d> 
dj  Ad2. 

This  is  a  generalization  operator  (Plotkin,  1971),  as 
defined  in  (Muggleton,  1994).  For  a  set  of  description 
S,”A  minimal  generalization  G  of  S  is  a  generalisation  of 


S  such  that  S  is  not  a  generalisation  of  G,  and  there  is  no 
generalization  G'  of  G  such  that  G  is  a  generalization  de 
G’ 

2.2  Galois  lattice. 

Let  us  begin  by  building  two  correspondances  between  the 
lattice  1  and  D. 

First  there  is  a  mapping  d  between  set  ^  and  the  description 
space  D:  d:  >D,  for  ej  e  d(ej)e  D  is  the  description  of 
the  example  ej. 

For  example: 

•  with  a  propositional  description,  d(ej)  is  a  list  of 
attributes. 

•  in  case  of  structural  description,  d(ej)  can  be  a  graph. 

Now,  from  this  simple  description  mapping,  we  can  build 
two  correspondences  between  I  and  D. 

The  corre.spondance  a:  D  ->  I  associates  each 
description  d  of  D  the  set  of  all  instances  of  the  training  set 
^  which  are  covered  by  d. 

a(d)={eie  d>d(ei)} 

Properties  1 

1)  d>d'  <=>  a(d)  a  a(d’) 

2)  a(d  ]  Ad2)  =  C((d  | )  u  a(d2) 

Proof  in  appendix 

The  corre.spondance  [}:  \  ->  D  is  equivalent  to 
making  the  least  general  generalization  for  the  description 
of  ail  the  elements  of  H  c  This  means  that: 

P(H)=  A,gHd(e) 

Theorem  1  The  correspondance  a  et  P  defines  a  Galois 
connection  between  I  and  D. 

Proof  in  appendix,  see  also  (Ganascia,  1993). 

Now  we  have  a  generalization  of  the  classical  definition  of 
concept. 

For  a  set  of  example  for  a  description  space  D,  for  an 
instance  space  I,  a  concept  C  is  a  pair  [Ext  x  Int]  with: 

•  Int  e  D  I  lnt=P(Ext)=  Agg  Extd(e) 

•  Exte  I  I  Ext=a(Int)={ejG^|  Int  >  d(ei)}. 

All  the  concepts  are  ordered  by  the  superconcept- 
subconcept  (generalisation-specialisation)  relation  >^. 

[E 1 ,  1 1  ]  >g  [El,  I2]  iff  E I  3  E2  and  1  ]  >  U 

With  >g,  the  set  of  all  concepts  has  the  mathematical 
structure  of  a  complete  lattice  and  is  called  the  Galois 
Lattice  of  the  context  (^  ,  d.  D). 
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3.  DEFINITION  OF  THE  ORDER  (>)  AND 
GENERALIZATION  OPERATION  (a) 
FOR  LABELLED  GRAPHS 

In  section  2,  we  proposed  a  formal  model.  In  this  model 
we  defined  two  basic  operations  >  and  a.  If  these 
operations  verify  different  properties  (order,  generalization 
operator),  then  the  concept  space  is  a  Galois  lattice. 

Our  goal  is  to  use  this  model  for  structural  description, 
more  precisely  for  graphs  descriptions. 

In  order  to  demonstrate  this  we  must  first  define  the 
operations  >  ,  a  and  prove  that  the  description  space  D  is 
a  lattice. 

3.1  >  Definition  for  labelied  graphs 

In  this  paragraph,  we  define  an  pre-order  between  graphs 
using  the  homomorphism  relation.  We  will  show  (3.2) 
that  for  a  class  of  graphs  (core  graphs),  this  pre-order  is  an 
order. 

Notations 

We  note  a  graph  G:(V,E,L)  . 

The  vertex  set  of  G  is  denoted  by  V(G). 

The  edge  set  of  G  is  denoted  by  E(G).  Each  edge  is  a 
ordered  pair  (v  \  ,V2),  v  j  ,v2e  V(G). 

The  Label  set  of  G  is  denoted  by  L(G).  For  a  vertex  v  we 
note  L(v)  the  label  of  this  vertex. 

In  the  following  paragraphs,  we  give  properties  fix- 
directed  graphs,  these  properties  are  true  as  well  fix- 
undirected  graphs. 

Definition  labelled  graph  homomorphism 
A  homomorphism  f:G\->G2  is  a  mapping 

/.-V(Gi)  ->  V(G2)  for  which  (f(vi),  f(v2))  e  E(G2) 
whenever  (vi,v2)  e  E(Gi)  and  L(vi)=L(v2) 


G1  G2 

Figure  1:  Homomorphism  example  G1  ->  G2 

This  is  not  the  classical  subgraph  isomorphism  relation. 
Operation  >:  D  x  D  {True,  False} 

For  two  labelled  graphs  Gi;(Vi,Ei,Li)  and 
G2;(V2!E2,L2),  we  note  Gi>G2  iff  there  is  a 
homomorphism  from  Gj  into  G2. 


Operation  s  ;  D  x  D  ->  {True,  False} 

Two  labelled  graphs  Gi  and  G2  are  homomorphically 
equivalent,  denoted  by  GisG2,  if  both  Gi^2  ^nd 

G2^1. 


Gl  G2 

Figure  2:  Gi=G2 

Operation  ?t.D  x  D  ->  {True,  False} ;  dj_?*d2  iff  not  (dj 
=  d2) 

3.2  D  for  labelled  graphs 

The  homomorphism  relation  is  only  a  pre-order  because 
the  antisymmetry  property  is  not  fulfilled  (Chein,  1992). 
An  order  relation  between  element  of  D  is  necessary  in 
order  to  use  results  of  section  2. 

The  same  problem  occurs  in  Inductive  Logic 
Programming  (Muggieton,  1 994) 

“Because  two  clauses  equivalent  under  0-subsumption  are 
also  logically  equivalent  (implication),  ILP  systems 
should  generate  at  most  one  clause  of  each  equivalence 
class.  To  get  around  this  problem,  Plotkin  defined 
equivalence  classes  of  clauses,  and  showed  that  there  is  a 
unique  representative  of  each  clause,  which  he  named  'the 
reduced  clause' 

In  the  case  of  labelled  graphs,  we  can  use  the  same 
strategy.  For  this  purpose,  we  use  the  class  of  core 
labelled  graphs  (Zhou,  1991). 

Definition  retract 

A  strict  subgraph  G'  of  G  is  a  retract  of  G  ((Zhou,  1991), 
if  there  is  a  homomorphism  called  a  retraction  r:  G  ->  G' 
such  that  r(v)=v  for  each  ve  V(G'). 

Definition  core 

A  graph  is  called  a  core  (or  minimal  graph  (Fellner, 
1982),  or  irredundant  graph  (Cogis,  1995))  if  it  has  no 
proper  retracts. 

Property  2 

For  the  equivalence  relation  defined  above  (s).  An 
equivalence  class  of  labelled  graphs  contains  one  and  only 
one  core  labelled  graph,  which  is  the  (unique)  graph  with 
the  smallest  vertex  number  (Mugnier,  1994). 

Notation  R: 

We  can  construct  a  core  graph  from  a  graph  as  proved  by 
Mugnier  (Mugnier,  1994).  This  operation  is  called 
reduction  (notation  R). 

Let  g  be  a  labelled  graph,  R(g)  is  a  core  labelled  graph 
such  that  g=R(g). 
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R(G) 


Figure  3:  Example  G  and  R(G) 

We  need  an  order  relation  in  order  to  use  labelled  graph. 
For  core  labelled  graph,  we  have  this  order.  So,  we  define 
D  as  a  set  of  core  labelled  graphs.  All  labelled  graph 
description  of  the  example  can  be  converted  to  an 
equivalent  core  labelled  graph,  using  the  R  operation. 


Gl  G2 


Theorem  2  The  restriction  of  >  to  the  set  of  core  labelled 
graphs  is  a  lattice  (Zhou,  1991),  (Poole,  1993),  (Chein, 
1994) 

For  this  lattice,  the  v  operation  is  the  disjoint  sum  of  the 
graph,  givg2=  gi+g2  (Chein,  1994)  (  g|  with  g2  form  a 
new  graph). 

The  A  operation  is  more  complex  and  is  defined  in  the 
following  paragraph. 

3.3  A  Definition  for  labelled  graphs 

The  A  operation  for  graph  is  based  on  a  following 
classical  Kronecker  product  operation  x  (Weichsel, 
1962). 


Figure  4:  Labelled  graphs  product 

Lemma  1 

if  G|,G2,G  are  labelled  graphs  then 

a)  G 1  X  G2  S  G 1  and  G  i  x  G2  S  G2 

b)  if  G  >  G 1  and  G  S  G2,  then  G  >  G 1  x  G2 

c)  G]  >  Gl  X  G2  if  and  only  if  G]  >  G2. 

Proof:  from  the  definitions  and  (Zhou,  1991) 

Remark:  The  product  operation  can  be  easely  improved 
when  the  label  set  is  a  hierarchy  or  a  lattice. 


Definition  x  operation  for  graphs 
For  two  graphs,  the  product  Gi  x  G2  has  the  vertex  set 
V(Gi)  X  V(G2)  and  the  edges  ((vi,v2),  (v'i,v'2)),  where 
(vi,v'i)  eE(Gi)  and  (v2,v'2)  e  E(G2). 

This  product  operation  can  be  determined  for  labelled 
graphs. 

Definition  x  operation  for  labelled  graphs 

For  two  labelled  graphs  Gi:(V[,Ei,Li)  and 

G2:(V2,E2,L2) 

The  product  G(V,E,L)  =  Gj  x  G2  is  defined  by: 

•  L  =  Li  n  L2 

•  V  cV|XV2={v|v=[vj  V2]  with  L(v  | )  = 
L(v2)andL(v)=L(v,)} 

•U=  {(v=[vj,v2],v'=[v'i,v'2])  I  (vi,v'i)eVj  and 
(v2,v'2)e  V2)(edge  oriented) 


Definition  operation  a:  D  x  D  —>D 

forGj  G2  two  core  labelled  graphs,  G]aG2=  R(GiX  G2) 


Figure  5:  G1AG2  with  G]  and  G2  defined  in  Figure  4 

3.4  Galois  lattice  for  graphs 

Now,  We  have  all  the  operations  for  the  construction  of  a 
Galois  lattice  when  example  are  described  by  graphs. 

Each  node  of  this  Galois  lattice  is  a  pair  [Ext  x  Int]  with 

Ext  is  a  subset  of  ^  and  Int  is  a  core  graph.  This  core 
graph  is  the  generalization  of  the  description  of  the 
examples  in  Ext. 
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Theorem  3 
With: 

^  a  set  of  examples, 

D  a  lattice  description  for  core  labelled  graphs, 
d  a  description  mapping,  d:  ^->D, 

I  the  instance  lattice,  n©  set  power  of 
>  and  order  relation  DxD-^  {True,false}, 

A  the  generalization  operation  DxD->D  for  graphs. 

We  can  define  a,p. 

for  ge  D,  a(g)={ejs  g>d(ej)} 

for  He  1,  P(H)=AegHd(e) 

The  correspondence  a  et  p  defines  a  Galois  connection 
between  I  and  D. 

Proof :  see  lemma  1  and  proof  for  the  Theorem  1 . 

Using  this  Galois  connection,  we  can  define  a  Galois 
lattice  (see  2.2).  We  name  this  lattice  T. 


We  have  defined  a  formal  model  for  labelled 
graphs.  This  model  uses  ,  >,  and  a  operations  in  the  case 
of  labelled  graphs.  In  the  next  section,  we  study  the 
complexity  of  these  operations. 

4.  THE  COMPLEXITY  OF  GALOIS 
LATTICE  CONSTRUCTION 

The  complexity  in  the  construction  of  a  Galois  lattice  in 
our  model,  is  a  function  of: 

1)  the  number  of  nodes  in  the  lattice, 

2)  the  time  and  space  complexity  of  the  operations  (a,R), 

3)  the  algorithm  used  for  Galois  lattice  construction  (see 
5.1) 

4.1  the  size  of  the  Galois  lattice 
Property  3:  The  number  of  nodes  for  T  can  be  2|^|. 

proof.  It  is  well  known  that,  Galois  lattice  can  be 
isomorphic  to  the  power  set  of  ^  (I)  which  is  the  maximal 
complexity  for  the  size  of  T. 

A  similar  situation  occurs  in  our  model.  The  proof  comes 
from  the  fact  that  each  binary  attribute  description  can  be 
converted  to  a  structural  description. 

For  example  the  list  [Big]  [Blue]  [Expensive]  can  be 
structurally  described  as: 


Figure  6:  graph  representation  for  an  list  of  attributes. 


Using  this  description,  the  a  operation  for  the  tree 
representation  is  equivalent  to  n  for  the  attribute 
representation. 

4.2  Complexity  of  the  a  operation 

In  (Muggleton,  1994)  S.  Muggleton  and  L.  de  Raedt 
wrote: 

"...  ILP  systems  can  get  around  the  problem  of  equivalent 
clauses  when  working  with  reduced  clauses  only". 

This  affirmation  is  true  but  the  problem  of  the  complexity 
of  the  R  operator  has  not  been  taken  into  account. 

•  for  two  labelled  graphs,  Gi=(Vj,Ei,Lj)  and 
G2“(^2>®2’^2)’  complexity  of  the  product  is:  ©(njx 
02)  where  nj=|Vj|  et  n2=|V2l . 

For  a  set  of  graph  P, 

G=>^Gi6  pOpI^C^Gie  P^i)- 
the  size  of  x^jg  pGj  can  be  exponential. 

Property  4 

the  operation  R  is  co-Np-complete  (Mugnier,  1994).  So, 
in  general  application,  this  operation  cannot  be  used. 

However  we  do  have  an  interesting  result: 

Property  5  (Mugnier,  1994) 

If,  for  a  class  of  labelled  graphs,  the  homomorphism  is 
polynomial,  then  the  reduction  operation  is  polynomial. 

The  homomorphism  for  the  following  class  of  labelled 
graphs  is  polynomial. 

•  trees  (Mugnier,  1994), 

•  locally  injective  graph  (Liquiere,  1994)  (see  definition 
below) 

•  1/2  locally  injective  graph  (see  definition  below)  (see 
langage  theory,  automata  (Aho,  1986)) 

Property  6 

For  a  set  of  path  or  tree  P,  G=AGjg  pGj  is  polynomial 
(time  and  size)  (Horvath,  1995) 

4.3  Study  for  a  class  of  Graphs. 

We  study  the  complexity  of  the  operation  (A,R)  for  the 
class  of  locally  injective  graphs  (LIG)  (Liquiere,  1994). 

Notation 

We  note  N''‘(v)=  {v’  |  (v,v')6V}  and  N'(v)=  {v'| 
(v',v)€V}. 
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Derinition  LIG  graph 

For  a  labelled  graph  G=(V,E,L),  G  is  locally  injective  if 
for  each  vertex  veV,  V  v | ,V2e  N'''(v),  V|^V2  =>  L(v]>;': 
L(v2)  and  V  v  |  ,V2e  N‘( v),  v  ]  7^V2  =>  L(v  |  L(v2). 


G1  G2 

Figure  7.  LIG  Property 

In  figure  7,  Gj  is  an  LIG  graph.  G2  is  not  an  LIG  graph 
because,  for  the  C  node,  there  is  two  edges  C  ->  b.  G2  is  a 
1/2  locally  injective  graph  (see  the  next  definition). 

Definition  1/2  locally  injective  graph 
A  1/2  locally  injective  graph  is  a  oriented  graph  where 
V  vj,V2€N'*'(v)  (resp.  N‘(c)) ,  V[;4V2  =*  L(v|);ii  L(v2). 

Property  7;  >  is  polynomial  for  locally  injective  graph 
(Liquiere,  1994)  and  for  1/2  locally  injective  graph  (Aho, 
1986). 

Property  8:  For  Gj  G2  two  LIG,  G=  G|X  G2  is  a  LIG. 
Partial  proof;  come  from  the  definition  of  x  . 

Property  9:  A  connected  LIG  is  an  irredundant  graph 
(Cogis,  1995) 

Property  10:  For  G  a  LIG,  we  note  CC(G)  the  set  of 
maximal  connected  subgraph  of  G.  Then  R(G)= 
{cjeCC(G)|  Vjj  there  is  no  projection  from  Cj  to  cj  } 

Proof;  property  9  =>  Property  10 

These  properties  are  interesting  because  for  LIG  we  can 
construct  the  R,  >,  x  and  a  operations,  for  two  graphs, 
with  a  polynomial  complexity. 

Property  11;  For  a  set  of  1/2  locally  injective  graphs  P, 

G^AQjgpGi  is  size  exponential  so  time  exponential 
(results  for  deterministic  automata  (Aho,  1986)). 


Table  1 :  Complexity  for  different  class  of  language 

Language  =,  R  Size  for  n  graphs 

Path  P  Polynomial 

(Horvath,  1995) 


Tree  P  Polynomial 

(Horvath,  1995) 


LIG  P  ? 

(Liquiere,  1994) 


1/2LIG  P  Exponential 

(Aho,  1986) 

Graph  NPC  Exponential 

(Garey,1987) 

With  P:  polynomial,  NPC:  NP  complete. 

5.  GRAAL  IMPLEMENTATION 

Traditionally  machine  learning  offeis  mechanisms  for  a 
class  of  language.  The  idea  is,  if  an  algorithm  is  good  for 
a  general  class  of  language,  it  would  also  work  well  for  a 
less  general  class  included  in  the  first  one.  It  is  true,  but 
in  many  cases,  the  general  mechanism  does  not  use  all  the 
interesting  properties  of  the  restricted  language.  So  the 
complexity  of  the  operation  is  not  optimal  for  this 
language. 

A  second  drawback  of  this  approach,  comes  from  the  need 
fora  translation  process.  Each  description  in  the  restricted 
language  has  to  be  converted  into  a  more  general  one.  For 
example  a  list  of  attributes  is  converted  into  a  graph 
(Liquiere,  1994). 

In  our  new  method.  Graal  (for  GRAph  And  Learning),  we 
have  implemented  a  general  mechanism  where  description 
language  and  operations  a,>  are  parameters.  Our  tool  is 
generic  but  it  cannot  yet  be  used  in  practical  cases  when 
important  sets  of  examples  are  described  by  large  graphs. 
It  in  fact,  an  algorithm  for  formal  analysis. 

5.1  A  utilization  of  a  classical  Algorithm  for 
Galois  Lattice  construction. 

We  give  an  algorithm  which  can  be  used  on  any 
description  language  with  operations  <  and  a. 

This  algorithm  is  based  on  a  classical  method  (Chein, 
1969).  Another  algorithm  can  be  used  (Bordat,  1986) 
which  gives  the  set  of  nodes  of  the  Galois  lattice  and  also 
the  set  of  edges. 

We  note  [ej  x  dj]  the  concept  numbered  i  of  T*^. 
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T<-0  /*  concept  set  empty  */ 

/*  description  of  the  examples  */ 

T»<-{[{i}xd(ei)]}and  i€[l,|^|] 
k<-l 

While  IT^I  >  1  do 

For  each  i<j  and  ije  [1, 1T*^|]  (so  we  have;  [ej  x  dj]  [ej 
X  dj]  )  /*  we  create  a  new  concept  from  two  concepts 
already  found  in  the  previous  step*/ 

/♦  description  for  the  new  concept  */ 
if  djj?i  0  then 

/*  test  if  there  is  a  concept  with  the  same  description  */ 
if  djj  e  then 

ejj  <-  CijUej 

else 

jk+l<.  jk+l  ^  X  djj] 

/*  test  if  the  description  is  a  generalisation  */ 
if  djj  =  dj  then 

Xk  <-  xk  -  [cj  X  dj  ] 

if  djj  =  dj  then 

jk  <-  Xk  -  [ej  X  dj] 

End-if 

end-for 
T<-  T  o  jk 
k<-k+l 
end-while 
T<-  T  u  jk 

Graal  is  written  in  Java  language  and  uses  object 
programming  properties.  We  have  defined  an  abstract 
class  (interface)  so  a  user  can  add  his  own  description 
language  if  he  implements  the  interface. 

The  complexity  of  Galois  Lattice  construction  with  the 
Bordat’s  algorithm  (Bordat,  1986)  is  less  than  0(n^*p) 
where  n  is  the  number  of  objects  and  p  the  size  of  T. 

5.2  An  experimental  example. 

We  present  an  example  where  each  object  is  described  by 
a  locally  injective  labelled  graph. 


We  use  a  classical  example  based  on  arch  definition. 


E2  E3 


Figure  8;  set  of  examples 


The  lattice  is: 


Figure  9;  the  structure  of  the  Galois  lattice  for  our  set  cf 
examples. 

For  each  node  of  the  lattice  there  is  a  pair  consisting  of  a 
graph  and  a  set  of  examples.  Additionally,  if  nodej.. 
nodek  are  linked  to  nodep  then  nodep  is  the  least 
common  superconcept  (generalisation)  of  nodej...  node^. 

In  figure  9  we  observe  the  subset  of  examples  (extension). 
In  out  tool,  by  double  diking  on  a  node  we  obtain  the 
following  descriptions. 


312  Liquiere  and  Sallantin 


rectar^le  -►on  ►square 
x[2,4] 


rectangle 

1  on 

on 

1 

on 

▼ 

rectangle 

^  circle 

rectangle 

X 

[0,1,2] 

X  [0,1,3] 

rectangle 

1 

on 

circle 

right 

X  [0,1,4] 

on 

rectangle 

rectangle 

i 

on 

rectangle 

on 

circle 

X  [0,1, 2, 3]  x[0,l,2,4]  X  [0,1, 3,4] 


rectangle 

on 

X  [0,1,2,3,41 

Figure  10:  the  graph  and  subset  for  each  node. 

This  lattice  gives  all  the  classification  for  the  examples 
without  duplication.  All  concepts  are  differents 
(description  and  extension),  two  differents  descriptions 
necessarily  have  di.stinct  extensions. 

Remark:  For  this  example,  the  unconnected  nodes  like  on 
can  be  interpreted  as:  there  is  something  on  something. 

6.  CONCLUSION 

Our  work  enlarges  and  expands  the  domain  of  formal 
concept  analysis  by  demonstrating  that  the  Galois  lattice 
can  be  used  fbr  structural  description. 

Coming  from  graph  theory,  our  work  provides  operations 
and  shows  that  they  can  be  used  to  build  a  generalization 
operator  for  labelled  graphs. 

In  addition,  the  LIG  graphs  we  use  are  an  excellent 
compromise  between  complexity  and  expressiveness. 


Our  method,  written  in  Java,  offers  a  general  tools  for 
formal  structural  concept  analysis. 

We  are  now  working  on  the  following  improvements: 

•  To  prove  that  LIG  is  PAC  leamable  or  not, 

•  a  survey  of  classical  Galois  lattice  results  in 
case  of  structural  concept  description, 

•  an  implementation  of  heuristic  in  Graal,  to 
make  Graal  a  tool  for  practical  application, 

•  an  improvement  of  the  approaches  with 
categorical  operations. 
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Appendix 

Properties  I  proof 
Proof  1) 

g>g'  <=>  a(g)  2  a(g') 

=>  property  of  the  order  relation  > 

<=  a  definition 
a(giAg2)  =  a(gi)u  a(g2) 

Proof!) 

a(g)={ee4/g>d(e)} 

We  have  giAg2Sgl  and  giAg2^g2  so  a(g|Ag2) 
2a(gi)ua(g2) 

We  know  that  giAg2  ,  Vg  ,g>gi  and  g>g2  then  ^ 
gjAg2.so  a(g)  2  a(giAg2)  2  ct(gi)  u  a(g2) 
if  a(giAg2)  =>a(gi)  u  a(g2)  then  for  gV  a(g')=a(gi) 
u  ct(g2)  we  have 

glAg2>g'  and  g'^gj  then  g'^g2  so  we  don't  have 
the  property  of  giAg2 

then  a(giAg2)  =  a(gi)ua(g2) 

Theorem  1  proof 

a  and  P  is  a  Galois  connection  iff : 

a) Vliandl2e  l,Iicl2=>P(li)<  P(l2) 

b)  V  gi,g2eD,  gi>g2  =>  a(gi)2  a(g2) 
and  for  h=  a  o  p  and  h’=P  o  a 

c)  V  H  e  I,  H  c  h(H) 

d)  V  g  6  D,  g  Sh'(g)  (remark  classicaly  we  note 
generalisation  <  so  we  have  a  more  classical  definition. 

Proof  a)  We  have  giAg2  ^  gj  then 

P(Il)=AggEi  d(e)=  g  and  Iicl2P(l2)=  (P(Il)) 
A(P(l2-Ii)so  P(l2)^P(Ii) 

Proof  b)  We  have  gj  >  g2  and  g2  S  g3  =>  gj  S  g3 
because  >  is  an  order  relation  a(g2)={e€  ^  g2^d(e)} 
we  have  giSg2  then  g|>  d(e)  with  eea(g2)  so  a(gj) 

2  a(g2) 

Proof  c)  We  have  gj  >  g  =*  gjA  g2  ^  g 

P(H)=Agg  H  d(e),  a(P(H))={e6^  /  P(H)>d(e)} 

But  VesH,  we  have  P(H)>d(e)  because  P({e}u  K)>d(e) 
property  of  a. 

Proof  d)  We  have,  g>gi  and  g^g2=>  g^gl  Ag2  so 

V  g  e  D,  h'(g)=  p(a(g)),  a(g)={e6^/  g>d(e)}, 

g  sAgg  af  because  g>d(e)  with  ee  a(g). 
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Abstract 

Cross-language  latent  semantic  indexing  is 
a  method  that  learns  useful  language- 
independent  vector  representations  of  terms 
through  a  statistical  analysis  of  a  document- 
aligned  text.  This  is  accomplished  by  tak¬ 
ing  a  collection  of,  say,  English  paragraphs 
and  their  translations  in  Spanish  and  pro¬ 
cessing  them  by  singular  value  decomposition 
to  yield  a  high-dimensional  vector  represen¬ 
tation  for  each  term  in  the  collection.  These 
term  vectors  have  the  property  that  seman¬ 
tically  similar  terms  have  vectors  with  high 
cosine  measure,  regardless  of  their  source 
language.  In  the  present  work,  we  extend 
this  approach  to  the  case  in  which  English- 
Spanish  translations  are  not  available,  but 
instead,  translations  for  documents  in  both 
languages  are  available  in  a  third  “bridge” 
language,  say,  French.  Thus,  although  no 
aligned  English-Spanish  documents  are  used, 
our  method  creates  a  representation  in  which 
English  and  Spanish  terms  can  be  compared. 
The  resulting  vector  representation  of  terms 
can  be  useful  in  natural  language  applications 
such  as  cross-language  information  retrieval 
and  machine  translation. 


1  INTRODUCTION 

Vector  representations  of  word  “meaning”  are  useful 
for  creating  computer  systems  that  manipulate  tex¬ 
tual  information.  Such  representations  are  routinely 
used  in  information  retrieval  (Salton  and  McGill,  1983; 
Deerwester  et  al.,  1990)  and  filtering,  and  may  find 
application  in  word  sense  disambiguation,  natural  lan¬ 


guage  generation,  text  comprehension  and  summariza¬ 
tion,  and  speech  recognition.  This  vector  lexicon  rep¬ 
resentation,  in  which  words  are  “defined”  by  numerical 
vectors,  has  even  served  as  a  model  for  the  acquisition 
of  human  knowledge  (Landauer  and  Dumais,  1997). 

A  natural  extension  of  the  vector-lexicon  represen¬ 
tation  for  terms  in  a  single  language  is  a  language- 
independent  representation.  This  is  a  promising  tech¬ 
nique  used  in  cross-language  text  retrieval  (Landauer 
and  Littman,  1990;  Oard,  1997;  Carbonell  et  al., 
1997),  in  which  natural-language  queries  in  one  lan¬ 
guage  are  matched  against  documents  in  another  lan¬ 
guage.  It  may  also  be  important  to  automating  the 
creation  of  machine-translation  systems.  The  key 
property  of  a  good  multi-lingual  vector  lexicon  is  that 
terms  with  similar  meanings,  regardless  of  their  lan¬ 
guages,  are  assigned  vectors  with  high  cosine  measure. 

A  major  appeal  of  the  vector-lexicon  representation  is 
that  it  can  be  learned  automatically  from  text.  In  la¬ 
tent  semantic  indexing  (LSI)  (Deerwester  et  al.,  1990), 
this  is  accomplished  by  taking  a  collection  of  text  that 
contains  m  documents  (paragraphs,  articles,  abstracts, 
etc.)  and  he  distinct  English  terms,  and  forming 
an  He  X  m  term-document  matrix  £.  The  entry  fy 
records  a  value  related  to  the  number  of  times  term 
i  appears  in  document  j;  in  our  work,  we  use  “log 
entropy”  weighting:  £ij  =  ln(1.0  -f  tfy)  x  gi,  where 

9i  =  l-{gfilog2(gfi)-]C(tf<jlog2(tfy))/(gfjlOg2(m))), 
j 

(1) 

tfy  is  the  number  of  times  term  i  appears  in  document 
j,  Eind  gfj  =  X^jtfy.  This  weighting  scheme  gives 
higher  weights  to  distinctive  terms. 

For  any  given  dimensionality  k,^  we  can  find  a  k- 
dimension  vector  lexicon  by  performing  a  singular 

^Reasonable  choices  for  k  range  from  50  to  1000,  de- 
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value  decomposition  (SVD)  and  reestimating  £  « 
EHV'^ ,  where  £  is  an  nf;  x  A:  matrix,  V  is  an  m  x  fe 
matrix,  both  E  and  V  are  orthonormal  =  /, 

V'^V  =  /),  and  E  is  a  fc  x  fc  diagonal  matrix  of  the 
largest  singular  values  of  £.  In  many  experiments,  it 
has  been  demonstrated  that  the  matrix  E  of  the  left 
singular  vectors  of  is  a  useful  vector  lexicon  of  the 
terms.  This  can  be  justified  as  follows.  A  basic  result 
from  linear  algebra  (Golub  and  Van  Loan,  1989)  is 
that,  of  all  rank  k  matrices,  EEV”^  gives  the  best  ap¬ 
proximation  of  £.  Similarly,  document-document  cor¬ 
relations  S'^S  are  approximated  by  (Ey^)^(SV^)  sa 
(£^E)(£^E)^.  As  S'^E  is  precisely  the  result  of  rep¬ 
resenting  corpus  £  via  vector  lexicon  E,  this  choice  of 
vector  lexicon  best  captures  the  document-document 
correlations  in  the  training  corpus. 


LSI  has  been  extended  to  produce  vector  lexicons 
for  groups  of  languages  simultaneously  by  analyzing 
aligned  document  collections.  Here,  in  addition  to  an 
riB  X  m  term-document  matrix  £  of  English,  we  also 
have  a  ns  '>i  m  term-document  matrix  S  of  Spanish. 
These  matrices  represent  an  aligned  corpus  in  that  the 
jth  documents  in  the  collections  are  on  the  same  topic, 
or  even  translations  of  each  other.  Cross-language 
LSI  (CL-LSI)  (Landauer  and  Littman,  1990;  Littman 
et  al.,  1998)  works  by  computing  a  fc-dimension  singu¬ 
lar  value  decomposition  of  the  (ns  +  ns)  x  m  matrix 


■  £  ■ 

■  E  ■ 

S 

5 

Here,  the  matrix 


E 

S 


is  orthonormal  (E  and  S  are 


not),  and  E  and  5  can  serve  as  vector  lexicons  for  the 
English  and  Spanish  terms,  respectively. 


CL-LSI  has  been  successfully  applied  to  many  different 
pairs  of  languages,  such  as  English-French  (Littman 
et  al.,  1998),  English-Japanese  (Landauer  et  al.,  1992), 
English-Spanish  (Carbonell  et  al.,  1997),  and  English- 
Greek  (Berry  and  Young,  1995).  It  has  also  been 
applied  to  language  triples  such  as  English-French- 
Spanish  and  English-Prench-German  (Rehder  et  al., 
1997)  when  three-way  document-aligned  corpora  are 
available. 


Ultimately,  our  goal  is  to  solve  the  corpus  hypergraph 
problem,  illustrated  in  Figure  1.  In  a  corpus  hyper¬ 
graph,  nodes  are  different  languages  and  hyperedges 
represent  aligned  corpora.  A  hyperedge  connects  the 
set  of  languages  appearing  in  the  corresponding  cor¬ 
pus.  The  corpus  hypergraph  problem  is,  given  a  set 

pending  on  the  application;  in  general,  we  must  have 
k  <  min(m,n£;).  We  use  k  =  500  in  this  work. 


Figure  1;  A  corpus  hypergraph  shows  how  various  lan¬ 
guages  are  connected  by  available  corpus  resources. 
The  darkened  subgraph  is  the  corpus  hypergraph  used 
in  our  experiments. 


of  languages  related  by  corpora  given  in  a  hypergraph, 
support  the  comparison  of  documents  expressed  in  any 
two  languages  connected  by  a  path  in  the  hypergraph. 
If  this  could  be  accomplished  for  the  corpus  hyper¬ 
graph  in  Figure  1,  it  would  become  possible  to  compare 
text  passages  in  Russian  to  text  passages  in  Greek, 
even  though  no  corpus  was  provided  that  related  those 
two  languages  directly. 

As  yet,  no  general  solution  to  the  corpus  hypergraph 
problem  has  been  proposed.  In  this  paper,  we  con¬ 
sider  an  important  and  difficult  special  case  in  which 
all  available  pairwise  document-aligned  corpora  have  a 
single  core  language  in  common.  In  our  experiments, 
this  lingua  franca  is  French,  although  we  expect  En¬ 
glish  to  play  this  role  in  many  applications. 

In  particular,  we  consider  the  problem  of  finding  a 
vector  lexicon  for  English,  Spanish,  and  French  terms 
given  a  document-aligned  English-French  corpus  and  a 
document-aligned  Spanish-French  corpus.  This  would 
be  represented  by  a  corpus  hypergraph  with  three 
nodes  for  the  three  languages,  an  edge  between  En¬ 
glish  and  French,  and  another  edge  between  Spanish 
and  French;  this  is  the  darkened  subgraph  in  Figure  1. 
We  call  such  a  corpus  partially  aligned  because  each 
document  is  available  in  only  French  and  English  or 
French  and  Spanish,  but  not  all  three  languages. 

Section  2  describes  the  way  we  evaluate  our  proposed 
methods.  In  Section  3,  we  show  that  the  application 


316  Littman,  Jiang,  and  Keim 


of  CL-LSI  to  partially  aligned  corpora  does  not  cre¬ 
ate  adequate  vector  lexicons.  In  this  method,  we  at¬ 
tempt  to  capture  the  relationships  between  all  three 
languages  using  a  single  singular  value  decomposition. 
In  Section  4,  we  show  how  a  technique  called  Pro¬ 
crustes  analysis  can  be  applied  to  combine  separate 
SVD  analyses  to  create  a  unified  representation.  This 
technique  shows  great  promise  in  our  initial  evalua¬ 
tions.  We  conclude  with  simple  baseline  comparisons 
and  thoughts  about  the  applicability  of  our  approach. 

2  EXPERIMENTAL  EVALUATION 

We  will  explore  techniques  for  attacking  the  following 
problem.  The  input  is  a  set  of  four  term-document  ma¬ 
trices:  a  document-aligned  pair  of  English  and  French 
matrices,  and  a  document-aligned  pair  of  Spanish  and 
French  matrices.  Each  matrix  contains  500  documents 
drawn  from  a  corpus  of  United  Nations  reports  (de¬ 
scribed  in  more  detail  below).  The  output  is  a  500- 
dimension  vector  lexicon  defining  each  English,  Span¬ 
ish,  and  French  term. 

These  vectors  are  then  evaluated  by  the  mate-retrieval 
test  (Littman  et  al.,  1998).  In  this  test,  we  take  an 
independent  2500-document  aligned  English-Spanish 
corpus  and  represent  each  document  in  the  corpus  by 
the  sum  of  the  vectors  representing  the  terms  it  con¬ 
tains  (with  term  vector  i  weighted  by  gt  from  Equa¬ 
tion  1).  Then,  for  each  English  document  vector,  we 
compute  its  cosine  to  all  2500  of  the  Spanish  docu¬ 
ment  vectors  and  sort  the  resulting  list  from  largest 
to  smallest.  We  note  the  rank  of  the  Spanish  doc¬ 
ument  that  is  the  English  document’s  translation  or 
mate  in  the  document- aligned  collection.  We  repeat 
this  for  all  2500  English  test  documents  and  compute 
mean  and  median  ranks,  as  well  as  the  percentage  of 
English  documents  with  mates  at  rank  1  and  ranks  in 
the  top  10. 

To  the  extent  that  the  vector  lexicon  generated  is  good, 
similar  documents  will  get  similar  representations  and 
the  mate-retrieval  test  will  reveal  this  by  a  low  median 
rank.  Although  the  mate-retrieval  test  is  a  poor  sub¬ 
stitute  for  traditional  information-retrieval  evaluation 
methods,  it  is  sufficient  for  distinguishing  between  the 
techniques  explored  in  this  paper. 

2.1  EXPERIMENTAL  MATERIALS 

Our  experimental  text  collection  was  drawn  from  the 
United  Nations  Parallel  Text  Corpus  (version  1.0), 
available  through  the  Linguistic  Data  Consortium. 


This  collection  contains  approximately  1.5  gigabytes  of 
text  in  English,  French  and  Spanish.  The  majority  of 
the  documents  were  professionally  translated  from  En¬ 
glish,  although  some  of  the  documents  were  originally 
written  in  Arabic,  French,  Spanish,  Russian,  or  Chi¬ 
nese.  We  started  with  the  1990,  section  00  documents 
in  all  three  languages  (826  files).  We  then  removed 
SGML  tags  and  extracted  paragraphs  with  a  leading 
number  to  help  ensure  that  paragraphs  were  aligned 
between  the  three  languages.  We  selected  paragraphs 
that  had  at  least  10  words  and  that  varied  in  length 
no  more  than  75%  among  the  three  languages,  yielding 
6151  triplets  of  paragraphs,  which  in  English  ranged 
from  10  to  448  words  (mean  68.4,  s.d.  48.5). 

From  this  collection  of  paragraphs,  we  randomly  se¬ 
lected  two  disjoint  sets  of  500  three-way  aligned  para¬ 
graphs  to  serve  as  training  texts.  We  formed  6488  x  500 
English  term-document  matrices  £  and  £',  7812  x  500 
French  term-document  matrices  T  and  T' ,  and  8313  x 
500  Spanish  term-document  matrices  S  and  S'  (£,  !F, 
and  S  are  aligned,  and  £',  T' ,  and  S'  are  aligned). 
Note  that  we  tagged  all  terms  with  their  source  lan¬ 
guage,  so  our  results  are  not  due  to  the  incidental 
overlap  of,  e.g.,  numerals  and  proper  names  across 
languages.  Table  1  gives  an  example  of  the  type  of 
documents  we  used. 

From  the  same  collection,  we  extracted  2500  aligned 
English  and  Spanish  test  documents,  £  and  5,  for  use 
in  the  mate-retrieval  experiments.  All  matrices  are 
weighted  using  Equation  1. 

3  CROSS-LANGUAGE  LSI 

As  a  first  test,  we  applied  CL-LSI  to  the  problem  of 
learning  a  vector  lexicon  from  the  partially  aligned 
English-French-Spanish  corpus. 

3.1  FULLY  ALIGNED  CORPUS 

As  a  baseline,  we  began  with  an  experiment  using  the 
fully  aligned  English-French-Spanish  documents.  We 
used  CL-LSI  to  find  a  500-dimension  vector  lexicon  for 
the  terms  in  all  three  languages  by  computing  an  SVD 
of  the  matrix 


'  £  £'  ' 

■  E  ■ 

T  r 

F 

S  S' 

S 

yielding  representations  in  the  form  of  matrices  E, 
F,  and  S.  Evaluating  these  representations  using  the 
mate-retrieval  test  gave  the  performance  listed  Row  9 
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Table  1;  An  example  of  aligned  English,  Spanish,  and 
French  documents 

Despite  the  difFicult  situation,  there  has  been  a  strong  political 
desire  to  incorporate  child-related  activities  into  the  national 
political  agenda.  However,  more  efforts  are  needed  to  coor¬ 
dinate  the  activities  of  the  public  and  private  sectors.  The 
experience  gained  during  this  period  demonstrated  that  stud¬ 
ies  were  an  important  contribution  to  the  social  mobilization 
process,  and  that  UNICEF  cooperation  should  further  rein¬ 
force  the  adoption  of  low-cost  measures  for  child  survival  and 
development  (CSD),  with  greater  community  participation. 

Pese  a  la  diffcil  situacion  existente,  se  ha  expresado  un  notable 
interes  politico  por  incorporar  las  actividades  relacionadas  con 
el  niho  en  el  programa  politico  del  pais.  Sin  embargo,  hay  que 
desplegar  mayores  esfuerzos  para  coordinar  las  actividades  de 
los  sectores  publico  y  privado.  La  experiencia  adquirida  du¬ 
rante  este  periodo  demostrd  que  los  estudios  constituian  una 
importante  contribucidn  al  proceso  de  movilizacidn  social  y 
que  la  cooperacion  del  UNICEF  debia  reforzar  aun  mis  la 
adopcion  de  medidas  de  bajo  costo  destinadas  a  la  superviven- 
cia  y  el  desarrollo  del  nino  en  que  se  diera  mayor  participacion 
a  la  comunidad. 

On  a  pu  constater  qu’il  existait,  malgre  les  difficultes  une 
forte  volonti  d’intigrer  des  activitis  en  faveur  des  enfants 
dans  le  programme  politique  national.  Mais  Taction  du 
secteur  public  et  celle  du  secteur  privi  ne  sont  pas  en¬ 
core  suffisamment  coordonnies.  On  a  aussi  constati  i 
Tevidence  durant  cette  piriode  qu'il  est  trfes  utile  de  pou- 
voir  s'appuyer  sur  des  itudes  lorsqu'on  cherche  i  mobiliser  les 
collectivites  et  que  dans  Taction  menie  pour  assurer  la  survie 
et  le  developpement  des  enfants  TUNICEF  devrait  encour- 
ager  encore  davantage  i  prendre  des  mesures  peu  couteuses 
et  qui  fassent  davantage  intervenir  les  communautis. 


of  Table  2.  This  performance  is  quite  strong — the  av¬ 
erage  rank  of  an  English  document  compared  to  its 
Spanish  mate  is  2.0,  and  99.6%  of  the  time,  the  mate 
is  ranked  in  the  top  10.  This  is  consistent  with  results 
obtained  with  CL-LSI  in  other  text  collections. 

To  help  understand  mathematically  why  CL-LSI  is 
able  to  handle  fully  aligned  corpora  so  well,  con¬ 
sider  the  following  idealized  experiment.  Imagine  that 
S'  \  =  \  T  .F'  ]  =  [  5  <S'  ],  as  would  be 
the  case  if  language  translation  were  a  simple  matter  of 
word  substitution  (as  one  might  get  in,  say,  Pig  Latin). 
In  this  pure  case,  the  vector  lexicon  for  the  terms  in 
all  three  languages  E,  F,  and  S  can  be  shown  to  be 
equal  to  -^1/3  times  the  matrix  of  left  singular  vectors 


of  [  F"  F"'  ].  The  important  thing  here  is  that  the 
method  correctly  assigns  identical  vectors  to  the  words 
in  the  “different”  languages. 

3.2  PARTIALLY  ALIGNED  CORPUS 


In  the  partially  aligned  case,  the  matrices  S'  and  S 
are  not  available,  and  we  still  seek  to  find  a  vector 
lexicon  for  English  and  Spanish  so  that  similar  terms 
are  assigned  vectors  with  high  cosines. 

We  attacked  this  problem  using  CL-LSI  by  computing 
an  SVD  of  the  matrix 


V  = 


S  0 
F  T' 
0  S' 


Extracting  the  left  singular  vectors  from  this  decompo¬ 
sition,  we  obtained  a  500-dimension  vector  lexicon  for 
English,  French,  and  Spanish.  The  hope  was  that  the 
method  would  be  able  to  identify  English-Spanish  term 
relationships  transitively  through  the  common  French 
terms.  The  evaluation  of  the  derived  vector  lexicon 
using  mate  retrieval  appears  in  Row  2  of  Table  2. 

Although  this  is  perhaps  the  most  elegant  approach  to 
the  problem,  it  has  the  unfortunate  property  of  giv¬ 
ing  dismal  performance.  In  fact,  the  mean  and  me¬ 
dian  ranks  are  much  worse  than  the  1250  expected  by 
pure  chance  (Row  1).  The  reason  for  this  poor  per¬ 
formance  is  obvious  in  retrospect.  The  zeros  in  the 
definition  of  V  are  meant  to  denote  missing  informa¬ 
tion  (e.g.,  that  we  have  no  Spanish  documents  related 
to  S  and  F).  However,  SVD  treats  these  as  real  zeros 
and  detects  the  fact  that  English  and  Spanish  terms 
never  co-occur;  it  assigns  representations  that  capture 
this,  making  English  and  Spanish  terms  appear  quite 
different. 

A  similar  analysis  of  the  idealized  case  from  the  previ¬ 
ous  section  is  quite  instructive  here.  Imagine  that  S  = 
F  =  F'  =  S' .  Let  F  be  the  matrix  of  left  singular  vec¬ 
tors  of  F-,  this  is  the  vector  lexicon  learned  by  LSI  from 
an  analysis  of  F.  Applying  CL-LSI  to  V  yields  a  vector 
lexicon  for  English  of  E  =  [  —  ^1/6  F  —  y/Tj2  F  ] 
and  for  Spanish  of  5  =  [  — i/l/6  F  -^1/2  F  ]  .  This 
is  interesting  because  the  matrix  of  English-Spanish 
term  correlations  ES"^  =  — 1/3  FF^ ,  or  sign-reversed 
from  the  French-French  term  correlations.  Essentially, 
terms  that  should  have  the  highest  similarity,  like 
translations,  are  actually  assigned  the  most  dissimi¬ 
lar  vectors  by  this  method.  While  this  simple  analysis 
does  not  precisely  capture  the  complexity  of  a  realis¬ 
tic  experiment,  it  does  strongly  suggest  that  CL-LSI  is 
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Table  2:  Mate- retrieval  results  for  English  to  Spanish  for  several  approaches 


METHOD 

MEAN 

%  IN  TOP  1 

%  IN  TOP  10 

MEDIAN 

1 

partially  aligned 

2177.8 

0.3% 

0.7% 

2455 

2 

random  order 

1250.0 

0.0% 

0.4% 

1250 

3 

partially  aligned  (reversed) 

323.2 

7.9% 

28.9% 

46 

4 

Procrustes  on  terms  (to  FF) 

235.2 

11.6% 

37.8% 

26 

5 

Procrustes  on  terms 

216.5 

11.8% 

38.6% 

22 

6 

Word  matching 

45.4 

20.0% 

47.6% 

14 

7 

Dictionary  translation 

17.2 

79.9% 

93.8% 

1 

8 

Procrustes  on  documents 

14.0 

57.5% 

87.4% 

1 

9 

fully  aligned 

2.0 

92.2% 

99.6% 

1 

not  well  suited  to  analyzing  partially  aligned  corpora, 
even  under  idealized  circumstances. 

Row  3  of  Table  2  gives  the  results  of  applying  CL- 
LSI  to  the  partially  aligned  corpus,  then  performing 
the  mate-retrieval  experiment,  computing  document 
similarities  using  negative  cosine.  The  results  here  are 
actually  quite  good  considering  the  vector  lexicon  is 
being  used  backwards! 

In  the  next  section,  we  show  how  to  improve  perfor¬ 
mance,  with  the  added  benefit  that  we  no  longer  need 
to  use  the  negative  cosine  similarity  for  cross-language 
comparisons. 

4  PROCRUSTES  APPROACHES 

The  results  of  the  previous  section  show  that  it  does 
not  make  sense  to  use  a  single  SVD  to  create  a  vector 
lexicon  from  a  partially  aligned  corpus.  However,  it  is 
also  clear  that,  within  fully  aligned  corpora,  we  can  use 
CL-LSI  to  find  vector  lexicons  that  produce  acceptable 
performance. 

Since  a  partially  aligned  corpus  contains  smaller,  fully 
aligned  corpora  within  it,  this  suggests  a  different 
strategy.  Specifically,  we  can  take  the  fully  aligned 
corpus  formed  by  £  and  !F  and  use  it  to  build  a  vector 
lexicons  for  English  and  French,  E  and  F,  derived  by 

CL-LSI  from  a  singular  value  decomposition  of 

Once  this  is  done,  each  English  and  French  term  can 
be  thought  of  as  a  point  in  a  high-dimensional  vec¬ 
tor  space — the  rows  of  the  E  and  F  matrices  are  the 
coordinates  of  the  points. 

The  English-French  vector  space  can  also  be  thought 
to  contain  points  for  the  documents  in  various  docu¬ 
ment  collections,  since  a  document  can  be  represented 
as  the  weighted  sum  of  the  representations  of  the  terms 


English-French 


Figure  2:  After  performing  separate  CL-LSI  analy¬ 
ses,  English  terms  E,  French  terms  F,  English  test 
documents  E,  and  Ftench  training  documents  F  oc¬ 
cupy  the  English-French  vector  space  while  Spanish 
terms  S',  French  terms  F',  Spanish  test  documents  S, 
and  French  training  documents  F'  occupy  the  Spanish- 
FVench  vector  space. 


it  contains.  The  French  documents  are  the  points  cor¬ 
responding  to  the  rows  of  F  =  T^F  and  the  English 
test  documents  are  the  points  corresponding  to  the 
rows  of  E  =  £^E.  Within  the  English-French  vec¬ 
tor  space,  English  terms,  French  terms,  English  docu¬ 
ments,  and  FVench  documents  can  be  compared  using 
the  cosine  metric. 

Separately,  we  can  also  construct  a  Spanish-French 
vector  space  using  CL-LSI  applied  to  the  fully  aligned 
S'  and  F'  collections.  This  situation  is  depicted  in 
Figure  2.  Let  F'  and  S'  be  the  vector  lexicons  derived 
by  CL-LSI;  the  rows  of  these  matrices  are  the  points 
corresponding  to  the  terms  in  the  Spanish-French  vec¬ 
tor  space.  The  rows  of  F'  =  F'  and  S  =  S^S' 
are  the  points  corresponding  to  the  French  training 
documents  and  Spanish  test  documents,  respectively. 
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Figure  3:  English  term  and  document  vectors  can  be 
transformed  from  the  English-French  vector  space  into 
the  Spanish-French  vector  space  using  a  Procrustes 
analysis  derived  from  shared  French  terms. 

4.1  PROCRUSTES  ON  TERMS 

Although  the  two  vector  spaces  in  Figure  2  are  sep¬ 
arate,  they  do  have  something  in  common.  Specifi¬ 
cally,  both  have  points  corresponding  to  French  terms. 
Thus,  the  rows  of  F  and  F'  can  be  seen  as  a  set  of 
“bridge”  vectors  that  can  be  used  to  find  a  way  of 
transforming  points  in  the  English-French  vector  space 
into  the  Spanish-French  vector  space. 

Mathematically,  we  need  to  derive  a  transformation 
matrix  Tf-^f'  that  rotates  the  English-French  vec¬ 
tor  space  so  that  the  rows  of  F  become  as  simi¬ 
lar  as  possible  to  the  rows  of  F'.  Thus,  Tf-*f' 
should  be  a  rotation  matrix  {Tf-^f'^  Tf-^f'  =  I  = 
Tf-^F'  Tf-^f'^)  that  minimizes  the  Frobenius  norm 
HE'  -  F  Tf-^f'\\',  an  algorithm  for  computing  such  a 
transformation  (Golub  and  Van  Loan,  1989)  is  given 
in  Appendix  A  and  is  known  as  Procrustes  analysis? 

Given  the  transformation  matrix  Tf^f',  we  create  a 
new  vector  lexicon  E  =  E  Tf-^f'  •  The  vector  lexicon 
E  has  the  property  that  English-English  term  simi¬ 
larities  are  unchanged  from  the  CL-LSI-derived  vector 
lexicon  E,  since 

EE^  =  E  TF-^F'iF  TF-yF')"^ 

_  TP  rri  rp  T  jpT  1717^. 

=  Ili  1 F’  F-4 F'  > 

however,  these  English  term  vectors  now  occupy  the 
Spanish-French  vector  space,  so  comparisons  between 
English  and  Spanish  terms  can  be  made.  Figure  3 
illustrates  this  procedure. 

To  evaluate  this  idea,  we  carried  out  the  following  ex¬ 
periment.  First,  we  listed  all  the  French  terms  that 

^Procrustes  is  the  robber  of  Greek  legend  who  forced 
people  of  varying  sizes  to  fit  perfectly  in  a  fixed  sized  bed 
by  stretching  or  cutting  them  as  appropriate. 


appear  in  both  F  and  F'  (for  the  purposes  of  this 
experiment,  we  excluded  terms  with  digits  in  them). 
This  results  in  a  set  of  3311  French  terms,  from  which 
we  picked  500  terms  uniformly  at  random  without  re¬ 
placement  to  form  a  set  of  bridge  vectors.®  We  used 
these  terms  to  find  a  transformation  from  English- 
French  vector  space  td  Spanish-French  vector  space, 
applied  this  transformation  to  the  English  test  docu¬ 
ments  E  Tf-^f'  )  a^d  performed  the  mate-retrieval  test 
against  the  Spanish  test  documents  S.  The  result  of 
this  test  is  given  in  the  Row  5  of  Table  2. 

The  median  rank  of  mates  for  this  method  is  22,  mak¬ 
ing  it  better  than  the  performance  of  “reversed”  CL- 
LSI  applied  to  this  partially  aligned  corpus  (median 
r2uik  of  mate  of  46),  but  still  far  from  the  performance 
of  CL-LSI  on  the  folly  aligned  corpus  (median  rank  of 
mate  of  1).  Note  that  transforming  the  Spanish  terms 
into  the  English-French  vector  space  via  an  appropri¬ 
ately  defined  Tf'-^f  gives  precisely  the  same  mate- 
retrieval  performance  because  of  the  symmetric  nature 
of  the  transformation  matrix. 

To  understand  why  Procrustes  on  terms  did  not  per¬ 
form  quite  up  to  expectations,  we  tested  a  fundamen¬ 
tal  hypothesis  behind  this  approach.  In  particular, 
we  were  assuming  that  the  French  vector  lexicon  de¬ 
rived  from  the  English-French  corpus  was  essentially 
the  same  as  that  derived  from  the  Spanish-French  cor¬ 
pus.  In  particular,  we  were  assuming  that  the  French 
term-term  correlations  in  the  two  vector  spaces  were 
roughly  the  same.  To  test  whether  this  was  true,  we 
measured  the  stability  of  the  term-term  similarities  in 
the  two  vector  spaces  by  the  correlation  between  FF^ 
and  F'F'^;  it  is  0.52,  suggesting  a  positive  correlation, 
but  perhaps  not  one  strong  enough  to  form  an  ideal 
bridge  between  English  and  Spanish. 

4.2  PROCRUSTES  ON  DOCUMENTS 

We  felt  we  could  establish  a  stronger  bridge  using  a 
more  stable  French  vector  lexicon.  We  considered  the 
corpus  formed  by  taking  the  union  of  the  documents 
in  F  and  F' .  Let  Fff  be  the  vector  lexicon  derived 
by  LSI  on  the  matrix  \  F  E'  ].  This  vector  lexi¬ 
con  defines  a  French-French  vector  space  that  contains 
French  terms  Fff  os  well  as  both  sets  of  French  train¬ 
ing  documents  Fff  =  F^Fff  and  Fff*  =  F'^ Fff- 

As  we  had  hoped,  this  procedure  strengthened  some¬ 
what  the  stability  of  the  French  vector  lexicons.  In 

®Of  course,  there  are  many  other  ways  one  might  select 
a  set  of  terms  to  create  bridge  vectors.  Our  choice  was 
motivated  by  a  combination  of  simplicity  and  practicality. 
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particular,  for  the  set  of  French  bridge  terms,  the 
correlation  between  FF^  and  FffF'ff^  is  0.74  and 
the  correlation  between  F'F'^  and  FffFff^  is  0.72. 
Hence,  we’d  expect  the  bridge  between  the  English- 
French  vector  space  and  the  French-French  vector 
space  to  be  stronger  than  the  one  between  the  English- 
French  vector  space  and  the  Spanish-French  vector 
space  from  the  previous  experiment,  where  the  cor¬ 
relation  was  0.52. 

To  evaluate  the  revised  method,  we  constructed 
TF-iFpF  Slid  Tf'-^Fff  by  Procrustes  analysis  and 
transformed  the  English  and  Spanish  test  documents 
from  their  respective  vector  spaces  into  the  French- 
French  vector  space. 

Performing  a  mate-retrieval  test  on  these  collections 
gave  the  results  appearing  in  Rnw  4  of  Table  2,  which 
are  just  slightly  worse  than  those  of  the  previous  exper¬ 
iment  (e.g.,  median  rank  of  mate  of  26  instead  of  22), 
in  spite  of  the  apparently  higher  correlations  for  the 
bridge  vectors.  We  explain  this  by  noting  that  in  the 
previous  experiment,  English  terms  were  transformed 
into  the  Spanish-French  vector  space  via  a  transforma¬ 
tion  with  a  correlation  of  0.52.  Here,  before  English 
and  Spanish  terms  or  documents  can  be  compared,  two 
transformations  take  place.  Under  an  independence 
assumption,  we  estimate  the  cumulative  correlation  of 
the  two  transformations  as  .74  x  .72  «  0.53,  which  is 
about  the  same  as  in  the  previous  experiment. 

We  then  repeated  the  experiment  with  the  difference 
that  we  based  our  transformations  on  document  vec¬ 
tors  instead  of  term  vectors.  Our  intuition  is  that  doc¬ 
uments  are  more  stable  predictors  of  meaning  than  are 
individual  terms. 

Figure  4  illustrates  our  technique  of  merging  vec¬ 
tor  spaces  by  a  Procrustes  analysis  of  shared  docu¬ 
ments.  The  basic  idea  is  that  we  first  create  separate 
English-French,  Spanish-French,  and  French-French 
vector  spaces  from  the  various  combinations  of  pieces 
of  the  partially  aligned  corpus.  We  then  create  vec¬ 
tor  representations  for  the  French  training  documents 
F  in  both  the  English-French  F  and  French-French 
Fff  spaces.  This  allows  us  to  create  a  transformation 
matrix  Tf_>Fff  from  English-French  to  French-French 
via  Procrustes  analysis.  English  test  documents  €  can 
now  be  located  in  the  French-French  space  via 

E  TF-^FFF■ 

A  similar  analysis  results  in  transforming  the  Spanish 
test  documents  into  the  French-French  vector  space. 

The  resulting  vectors  for  the  test  documents  can  then 


French-French 


Figure  4:  English  term  and  document  vectors  can  be 
transformed  from  the  English-French  vector  space  into 
the  French-French  vector  space  using  a  rotation  de¬ 
rived  from  shared  FVench  documents.  Spanish  term 
and  document  vectors  crin  be  transformed  analogously. 


be  evaluated  using  the  mate-retrieval  test.  The  perfor¬ 
mance,  given  in  Row  8  of  Table  2,  is  quite  encouraging. 
We  find  that,  more  than  half  the  time  English  test  doc¬ 
uments  are  closest  to  their  Spanish  mates  and  87.4% 
of  the  time,  the  mate  is  ranked  in  the  top  10  out  of 
2500  Spanish  test  documents.  This  is  not  quite  as  good 
as  the  results  of  training  directly  on  aligned  English- 
Spanish  documents,  but,  considering  the  small  amount 
of  training  data,  the  performance  is  quite  respectable. 

It  is  instructive  to  look  once  more  at  the  correlations 
of  the  bridge  vectors  used  to  map  between  the  vari¬ 
ous  vector  spaces.  The  correlation  between  the  French 
F  document-document  correlations  in  the  English- 
FVench  vector  space  FF^  and  the  French-French  vec¬ 
tor  space  FffFff^  is  0.98.  Similarly,  the  correlation 
between  the  French  F'  document-document  correla¬ 
tions  in  the  Spanish-French  vector  space  F'F'^  and 
the  French- French  vector  space  Fff^Fff*^  is  0.98. 
The  product  of  these  correlations  is  0.97,  providing 
strong  evidence  that  the  transformations  that  bring 
the  English  and  Spanish  test  documents  together  in 
the  French-French  space  are  robust  and  accurate. 
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5  BASELINE  COMPARISONS 

To  help  evaluate  our  results,  we  carried  out  two 
simple  comparative  experiments.  First,  we  repeated 
the  mate-retrieval  evaluation  using  English  to  retrieve 
Spanish  using  only  direct  word  matching.  That  is,  only 
words  appearing  in  both  languages  (proper  names, 
numbers,  acronyms,  and  some  cognates  like  “idea”) 
were  used  to  measure  the  similarity  between  docu¬ 
ments.  This  type  of  information  is  routinely  used  to 
align  texts  and  it  is  reasonable  to  ask  how  this  method 
compares  to  Procrustes  by  documents. 

The  result  of  mate-retrieval  via  word  matching  appe2irs 
in  Row  6  of  Table  2.  This  method  scores  between 
the  term-based  Procrustes  methods  and  Procrustes  by 
documents.  Thus,  Procrustes  by  documents  is  extract¬ 
ing  useful  information  beyond  simply  pairing  up  cross¬ 
language  homographs. 

A  second  comparison  used  an  available  21, 000- word 
English-Spanish  bilingual  dictionary  to  compare  En¬ 
glish  and  Spanish  documents.  Although  this  approach 
does  not  construct  a  vector  lexicon,  it  does  help  estab¬ 
lish  the  baseline  difficulty  of  the  mate-retrieval  task 
when  additional  linguistic  resources  are  available. 

We  created  a  simple  word-for-word  translation  system 
from  English  to  Spanish.  For  each  English  word  found 
in  the  dictionary,  we  substituted  all  Spanish  entries. 
English  words  not  found  were  left  untranslated  (al¬ 
lowing  proper  names  and  acronyms  to  pass  through). 
We  performed  no  stemming  or  morphological  analysis 
in  either  language.  The  translation  procedure  substi¬ 
tuted  about  23.6%  of  types  and  69%  of  tokens  in  the 
original  English. 

The  translated  English  documents  were  then  matched 
against  the  Spanish  test  documents  and  mate-retrieval 
statistics  collected;  they  are  given  in  Row  7  of  Table  2. 
In  terms  of  mean  rank  of  mate,  Procrustes  on  docu¬ 
ments,  which  used  no  direct  English-Spanish  informa¬ 
tion,  performed  better  than  the  human  constructed 
dictionary  (14.0  vs.  17.2).  However,  dictionary  trans¬ 
lation  scored  better  in  percentage  in  top  1  and  top  10. 

The  results  in  Table  2  suggest  that,  when  a  parallel 
corpus  is  available  (Row  9),  it  is  the  most  effective 
method  for  constructing  a  vector  lexicon.  In  the  ab¬ 
sence  of  a  parallel  corpus,  a  bilingual  dictionary  can 
be  used  to  match  documents  with  one  another  and 
this  works  relatively  well  (Row  7),  at  least  for  “query 
by  example”  type  of  tasks.  Procrustes  on  documents 
(Row  8)  performs  a  bit  less  well  by  some  measures, 
but  does  not  require  any  direct  resources  relating  the 


languages  in  question. 

That  a  parallel  corpus  technique  outperforms  dic¬ 
tionary  translation  is  consistent  with  earlier  studies 
measuring  average  precision  for  information-retrieval 
tasks  (Carbonell  et  al.,  1997).  We  are  in  the  process 
of  evaluating  Procrustes  by  documents  on  this  task. 


6  CONCLUSIONS 


In  this  paper,  we  introduce  the  problem  of  learning 
multi-lingual  vector  lexicons  for  partially  aligned  cor¬ 
pora.  We  couch  the  general  problem  in  terms  of  tak¬ 
ing  a  corpus  hypergraph  and  finding  vector  represen¬ 
tations  for  all  the  terms  so  that  semantically  similar 
terms  have  representations  with  high  cosine,  indepen¬ 
dent  of  language. 

Although  we  do  not  propose  a  solution  to  the  gen¬ 
eral  corpus  hypergraph  problem,  we  attack  the  specific 
case  in  which  two  languages  are  related  only  indirectly 
through  a  third  language.  Specifically,  we  have  an 
English-French  aligned  corpus  and  a  Spanish-French 
aligned  corpus  and  we  want  to  make  comparisons  tran¬ 
sitively  between  English  and  Spanish  terms.  We  evalu¬ 
ate  several  algorithms  for  deriving  vector  lexicons  from 
this  type  of  corpus. 

Of  the  approaches  we  considered,  the  most  successful 
was  Procrustes  on  documents,  in  which  English  and 
Spanish  terms  are  transformed  into  a  French-only  vec¬ 
tor  space  on  the  basis  of  the  representation  of  a  core  set 
of  French  documents.  A  vector  lexicon  derived  by  this 
method  was  able  to  perfectly  identify  Spanish  transla¬ 
tions  of  English  documents  from  a  field  of  2500  choices 
57.5%  of  the  time.  While  this  is  not  nearly  as  ac¬ 
curate  as  a  vector  lexicon  trained  directly  on  aligned 
English  and  Spanish  documents  (92.2%),  it  is  encour¬ 
aging  given  that  all  English-Spanish  term-term  rela¬ 
tionships  were  derived  indirectly  and  completely  auto¬ 
matically  through  connections  with  French  terms. 

Our  technique  is  applicable  for  finding  relationships 
among  arbitrarily  large  sets  of  languages  as  long  as  all 
these  languages  are  related  through  aligned  corpora 
with  some  core  language  (be  it  French,  English,  Rus¬ 
sian,  etc.).  In  future  work,  we  hope  to  extend  this 
approach  to  use  information  in  a  web  of  aligned  cor¬ 
pora  to  establish  a  robust  representation  for  terms  in 
any  of  the  world’s  languages. 
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A  PROCRUSTES  DERIVATION 

We  seek  a  transformation  Ta-^b  that  rotates  the  vec¬ 
tors  in  the  rows  of  A  and  makes  them  as  similar  as 
possible  to  the  vectors  of  the  rows  of  B.  Specifically, 
we  want  Ta-y  b  to  be  the  kx  k  orthonormal  matrix  Q 
such  that  Frobenius  norm  ||B-^(5||  is  minimized  over 
all  matrices  Q  with  Q^Q  =  /  =  QQ^ . 

First,  using  the  definition  that  trace(X')  is  the  sum  of 
the  diagonal  entries  of  X  and  ||X||  is  the  sum  of  the 
squares  of  all  the  entries  of  X, 

min  ||B  -  AQ\\ 

Q'^Q=I 

=  ^tr&ce{{B  -  AQ){B  -  AQ)^) 

=  ^(trace(BB^)  -  trace(B(5^j4^) 

-trace(>l(5B^)  -I-  trace(^A^)) 

=  max  trace(j4QB^). 
qtq=i 

Now,  note  that  trace(Xy)  =  trace(FX')  and  let 
A^B  =  be  the  singular  value  decomposition 

of  A^B.  This  gives  us 

max  tT&ce(AQB^)  =  max  trace(B^AQ) 
qtq^i  Q'^Q=i 

=  max  trace(FE[/^Q) 
Q'^Q=l 

=  max  ti&ceCEU^QV). 
Q'^Q=I 

Since  V,  Q,  and  U  are  all  orthonormal,  they  cannot 
increase  the  length  of  any  of  the  rows  of  E.  This  means 
that  the  maximum  of  the  sum  of  the  diagonal  entries 
of  trace(E17^QF)  occurs  when  U^QV  =  /,  or  when 
Q  =  UV'^.  Thus,  Ta-^b  =  UV"^  is  the  transformation 
that  maximizes  the  alignment  between  A  and  B. 
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Abstract 

Recent  research  on  hidden-state  reinforce¬ 
ment  learning  (RL)  problems  has  concen¬ 
trated  on  overcoming  partial  observability  by 
using  memory  to  estimate  state.  However, 
such  methods  are  computationally  extremely 
expensive  and  thus  have  very  limited  appli¬ 
cability.  This  emphasis  on  state  estimation 
has  come  about  because  it  has  been  widely 
observed  that  the  presence  of  hidden  state 
or  partial  observability  renders  popular  RL 
methods  such  as  Q-learning  and  Sarsa  use¬ 
less.  However,  this  observation  is  misleading 
in  two  ways:  first,  the  theoretical  results  sup¬ 
porting  it  only  apply  to  RL  algorithms  that 
do  not  use  eligibility  traces,  and  second  these 
results  are  worst-case  results,  which  leaves 
open  the  possibility  that  there  may  be  large 
classes  of  hidden-state  problems  in  which  RL 
algorithms  work  well  without  any  state  esti¬ 
mation. 

In  this  paper  we  show  empirically  that 
Sarsa(A),  a  well  known  family  of  RL  algo¬ 
rithms  that  use  eligibility  traces,  can  work 
very  well  on  hidden  state  problems  that  have 
good  memoryless  policies,  i.e.,  on  RL  prob¬ 
lems  in  which  there  may  well  be  very  poor 
observability  but  there  also  exists  a  mapping 
from  immediate  observations  to  actions  that 
yields  near-optimal  return.  We  apply  conven¬ 
tional  Sarsa(A)  to  four  test  problems  taken 
from  the  recent  work  of  Littman,  Littman 
Cassandra  and  Kaelbling,  Parr  and  Russell, 
and  Chrisman,  and  in  each  case  we  show  that 
it  is  able  to  find  the  best,  or  a  very  good, 
memoryless  policy  without  any  of  the  com¬ 
putational  expense  of  state  estimation. 
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1  Introduction 

Sequential  decision  problems  in  which  an  agent’s  sen¬ 
sory  observations  provide  it  with  the  complete  state 
of  its  environment  can  be  formulated  as  Markov  deci¬ 
sion  processes,  or  MDPs,  for  which  a  number  of  very 
succesful  planning  (Sutton  &  Barto,  1998)  and  rein¬ 
forcement  learning  (Barto  et  al.,  1983;  Sutton,  1988; 
Watkins,  1989)  methods  have  been  developed.  How¬ 
ever,  in  many  domains,  e.g.,  in  mobile  robotics,  and 
in  multi-agent  or  distributed  control  environments, 
the  agent’s  sensors  at  best  give  it  partial  informa¬ 
tion  about  the  state  of  the  environment.  Such  agent- 
environment  interactions  suffer  from  hidden-state  (Lin 
&  Mitchell,  1992)  or  perceptual  aliasing  (Whitehead 
&  Ballard,  1990;  Chrisman,  1992)  and  can  be  formu¬ 
lated  as  partially  observable  Markov  decision  processes, 
or  POMDPs  (e.g.,  Sondik,  1978).  Therefore,  finding 
efficient  reinforcement  learning  methods  for  solving  in¬ 
teresting  sub-classes  of  POMDPs  is  of  great  practical 
interest  to  AI  and  engineering. 

Recent  research  on  POMDPs  has  concentrated  on 
overcoming  partial  observability  by  using  memory  to 
estimate  state  (Chrisman,  1992;  McCallum,  1993;  Lin 
&  Mitchell,  1992)  and  on  developing  special  purpose 
planning  and  learning  methods  that  work  with  the 
agent’s  state  of  knowledge,  or  belief  state  (Littman 
et  al.,  1995).  In  part,  this  emphasis  on  state  esti¬ 
mation  has  come  about  because  it  has  been  widely 
observed  and  noted  that  the  presence  of  hidden  state 
renders  popular  and  succesful  reinforcement  learning 
(RL)  methods  for  MDPs,  such  as  Q-learning  (Watkins, 
1989)  and  Sarsa  (Rummery  &:  Niranjan,  1994),  use¬ 
less  on  POMDPs  (e.g.,  Whitehead,  1992;  Littman, 
1994;  Singh  et  al.,  1994).  However,  this  observation 
is  misleading  in  two  ways:  first,  the  theoretical  re¬ 
sults  (Singh  et  al.,  1994;  Littman,  1994)  supporting  it 
only  apply  to  RL  algorithms  that  do  not  use  eligibility 
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traces,  and  second,  these  results  are  worst-case  results 
which  leaves  open  the  possibility  that  there  may  be 
large  classes  of  POMDPs  in  which  existing  RL  algo¬ 
rithms  work  well  without  any  state  estimation. 

The  main  contribution  of  this  paper  is  to  show  empir¬ 
ically  that  Sarsa(A),  a  well  known  family  of  reinforce¬ 
ment  learning  algorithms  that  use  eligibility  traces, 
can  work  very  well  on  POMDPs  that  have  good  mem¬ 
oryless  policies,  i.e.,  on  problems  in  which  there  may 
well  be  very  poor  observability  but  there  also  exists  a 
mapping  from  the  agent’s  immediate  observations  to 
actions  that  yields  near-optimal  return.  We  also  show 
how  this  can  be  extended  to  low-order-memory-based 
policies.  This  contribution  is  significant,  because  it 
may  be  that  most  real-world  engineering  problems  that 
are  well  designed  have  good  memoryless  or  good  low- 
order-memory-based  policies.  We  apply  conventional 
Sarsa(A)  on  four  test  problems  taken  from  recent  pub¬ 
lished  work  on  POMDPs  and  in  each  case  show  that 
it  is  able  to  find  the  best,  or  a  very  good,  memory¬ 
less  policy  without  any  of  the  computational  expense 
of  state  estimation.  However,  these  results  have  to 
be  interpreted  with  caution  for  the  problem  of  finding 
optimal  memoryless  policies  in  POMDPs  is  known  to 
be  computationally  challenging  (Littman,  1994);  they 
are  evidence  that  Sarsa(A)  is  at  least  competitive  to 
and  at  best  better  than  other  existing  algorithms  for 
solving  POMDPs  when  good  low-order-memory-based 
policies  exist. 

2  POMDP  Framework 

In  this  section  we  briefly  describe  the  POMDP  frame¬ 
work.  An  environment  is  defined  by  a  finite  set  of 
states  S,  the  agent  has  recourse  to  a  finite  set  of  ac¬ 
tions  A,  and  the  agent’s  sensors  provide  it  observa¬ 
tions  from  a  finite  set  X.  On  executing  action  a  £  A 
in  state  s  £  S  the  agent  receives  expected  reward  r“ 
and  the  environment  transits  to  a  random  state  s'  £  S 
with  probability  .  The  probability  of  the  agent  ob¬ 
serving  X  £  X  given  that  the  environment’s  state  is  s 
is  ©(xjs).  In  the  reinforcement  learning  (RL)  problem 
the  agent  does  not  know  the  transition  and  observation 
probabilities  P  and  O  and  its  goal  is  to  learn  an  ac¬ 
tion  selection  strategy  that  maximizes  the  return,  i.e. 
the  expected  discounted  sum  of  rewards  received  over 
an  infinite  horizon,  where  0  <  7  <  1 

is  the  discount  factor  that  makes  immediate  reward 
more  valuable  than  reward  more  distant  in  time,  and 
rt  is  the  reward  at  time  step  t. 

In  fully  observable  RL  problems  or  MDPs  it  is  known 


that  there  exists  an  optimal  policy  that  is  memory¬ 
less,  i.e.,  is  a  mapping  from  states  to  actions,  S  —*  A. 
RL  algorithms  such  as  Q-learning  and  Sarsa  are  able 
to  provably  find  such  memoryless  optimal  policies  in 
MDPs.  It  is  known  that  in  POMDPs  the  best  memo¬ 
ryless  policy  can  be  arbitrarily  suboptimal  in  the  worst 
case  (Singh  et  ah,  1994).  We  ask  below  if  these  same 
RL  algorithms  can  find  the  best  memoryless  policy 
in  POMDPs  (Jaakkola  et  ah,  1995;  Littman,  1994), 
regardless  of  how  good  or  how  bad  it  is;  for  if  they 
are  able  to  find  it,  then  they  can  at  least  be  useful 
in  POMDPs  with  good  memoryless  policies.  We  note 
that  the  success  of  RL  algorithms  when  using  com¬ 
pact  function  approximation  in  fully  observable  prob¬ 
lems  (Barto  et  ah,  1983;  Tesauro,  1995)  provides  some 
evidence  that  this  is  possible  because  the  use  of  com¬ 
pact  function  approximation  introduces  hidden  state 
into  otherwise  completely  observable  MDPs. 

3  Eligibility  Traces  and  Sarsa(A) 

In  MDPs  reinforcement  learning  algorithms  such  as 
Sarsa(A)  use  experience  to  learn  estimates  of  optimal 
Q-value  functions  that  map  state-action  pairs,  s,  a,  to 
the  optimal  return  on  taking  action  a  in  state  s.  The 
transition  at  time  step  t,  <  St,  at,rt,  St+i  >,  is  used  to 
update  the  Q-value  estimate  of  all  state-action  pairs 
in  proportion  to  their  eligibility.  The  idea  behind  the 
eligibilities  is  very  simple.  Each  time  a  state-action 
pair  is  visited  it  initiates  a  short-term  memory  or  trace 
that  then  decays  over  time  (exponentially  with  param¬ 
eter  0  <  A  <  1).  The  magnitude  of  the  trace  deter¬ 
mines  how  eligible  a  state-action  pair  is  for  learning. 
So  state-action  pairs  visited  more  recently  are  more 
eligible. 

In  POMDPs  the  transition  information  available  to  the 
agent  at  time  step  t  is  <  Xt,ai,rt.,Xt+i  >■  A  straight¬ 
forward  way  to  extend  RL  algorithms  to  POMDPs 
is  to  learn  Q-value  functions  of  observation-action 
pairs,  i.e.,  to  simply  treat  the  agent’s  observations 
as  states.  Below  we  describe  standard  Sarsa(A)  ap¬ 
plied  to  POMDPs  in  this  manner.  At  step  t  the  Q- 
value  function  is  denoted  Qt  and  the  eligibility  trace 
function  is  denoted  t]i.  On  experiencing  transition 
<  xt,at,rt,Xt+i  >  the  following  updates  are  per¬ 
formed  in  order: 

■ni{xt,at)  =  1 

'i  {x  ^  Xt  or  a  ^  at)\  r]t{x,  a)  =  ')\'qt-\{x,a) 

Vx  and  a; 

Qt+\{x,a)  =  Q,{x,a)  +  a*5t  *r]t{x,a)  (1) 
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where  6t  rt  +  'yQt{xt+i,at+i)  -  and  a 

is  the  step-size.  The  eligibility  traces  are  initialized 
to  zero,  and  in  episodic  tasks  they  are  reinitialized 
to  zero  after  every  episode.  The  greedy  policy  at 
time  step  t  assigns  to  each  observation  x  the  action 
a  =  argmaXbQt{x,b).  Note  that  the  greedy  policy  is 
memoryless. 

3.1  Using  Sarsa(A)  with  Observation 
Histories 

The  Sarsa(A)  algorithm  can  also  be  easily  used  to  de¬ 
velop  memory-based  policies  by  simply  learning  a  Q- 
value  function  over  estimated-states  and  actions,  and 
by  keeping  eligibility  traces  for  estimated-state  and  ac¬ 
tion  pairs.  So  for  example,  we  could  augment  the  im¬ 
mediate  observation  with  the  past  K  observations  to 
form  the  estimated-state  and  derive  a  memory-based 
policy  that  maps  K  +  1  observations  to  actions.  The 
only  change  to  the  equations  in  (1)  would  be  that  the 
immediate  observations  (x’s)  would  be  replaced  by  the 
estimated  states. 

4  Empirical  Results 

The  Sarsa(A)  algorithm  was  applied  in  an  identical 
manner  to  four  POMDP  problems  taken  from  the  re¬ 
cent  literature  and  described  below.  Here  we  describe 
the  aspects  of  the  empirical  results  common  to  all  four 
problems.  At  each  step,  the  agent  picked  a  random  ac¬ 
tion  with  a  probability  equal  to  the  exploration  rate, 
and  a  greedy  action  otherwise.  Except  where  explic¬ 
itly  noted,  we  used  an  initial  exploration  rate  of  20% 
decreasing  linearly  with  each  action  (step)  until  the 
200000*^  action  from  where  onwards  the  exploration 
rate  was  0%.  Q-values  were  initialized  to  0.  The  agent 
starts  each  episode  in  a  problem  specific  start  state 
or  a  randomly  selected  start  state  as  specified  by  the 
originators  of  the  problems.  Both  the  step-size  (a)  and 
the  A  values  are  held  constant  in  each  experiment.  We 
did  a  coarse  search  over  a  and  A  for  each  problem  but 
present  results  only  for  A  =  0.9  and  a  =  0.01  which 
gave  about  the  best  performance  across  all  problems. 
In  all  cases,  a  value  of  A  between  0.8  and  0.975  worked 
the  best.  This  is  qualitatively  similar  to  the  results 
obtained  for  MDPs,  and  a  bit  surprising  given  that 
Sarsa(l)  (or  Monte-Carlo)  has  been  recommended  as 
the  way  to  deal  with  hidden  state  (Singh  et  ah,  1994). 

The  data  for  the  learning  curves  is  generated  as  fol¬ 
lows:  after  every  1000  steps  (actions)  the  greedy  pol¬ 
icy  is  evaluated  offline  to  generate  a  problem  specific 
performance  metric.  All  the  learning-curves  below  are 


plotted  after  smoothing  this  data  by  doing  a  running 
average  over  30  data  points. 

For  each  POMDP  we  first  present  its  structure  by 
defining  the  states,  actions,  rewards,  and  observations 
and  then  we  present  our  results. 

4.1  Sutton’s  Grid  World 

Sutton’s  grid  world  problem  (see  Figure  lA)  is  from 
Littman  (1994)  who  took  a  navigation  gridworld  from 
Sutton  (1990)  and  made  it  a  POMDP  by  not  allowing 
its  exact  position  to  be  known  to  the  agent. 

States:  This  POMDP  is  a  9  by  6  grid  with  several 
obstacles  and  a  goal  in  the  upper  right  corner  (see  Fig¬ 
ure  lA).  The  state  of  the  environment  is  determined  by 
the  grid  square  the  agent  occupies.  State  transitions 
are  deterministic. 

Actions:  The  agent  can  choose  one  of  4  actions:  move 
north,  move  south,  move  east,  and  move  west. 

Observations:  The  agent  can  observe  its  8  neigh¬ 
boring  grid  squares  yielding  256  possible  observations. 
Only  30  (of  the  256  possible)  unique  observations  oc¬ 
cur  in  the  gridworld.  Observations  are  deterministic. 
Figure  lA  shows  the  gridworld  with  observations  indi¬ 
cated  by  the  number  in  the  lower  right  corner  of  each 
square. 

Rewards:  The  agent  receives  a  reward  of  —1  for  each 
action  that  does  not  transition  to  the  goal  state.  A 
reward  of  0  is  received  for  any  action  leading  to  the 
goal  state. 

When  the  agent  reaches  the  goal  state  it  transitions  to 
a  uniformly  random  start  state. 

4.1.1  Sru-sa(A)  Results 

After  every  1000  steps  of  experience  in  the  world,  the 
greedy  policy  is  evaluated  to  determine  the  total  num¬ 
ber  of  steps  required  to  reach  the  goal  from  every  possi¬ 
ble  non-goal  start  state  (46  start  states).  The  agent  is 
limited  to  a  maximum  of  1000  steps  to  reach  the  goal. 
Thus  a  policy  which  cannot  reach  the  goal  from  any 
start  state  would  have  a  total  steps  to  goal  of  46, 000. 

Sarsa(A)  converged  to  the  416  total  step  policy  shown 
with  arrows  in  Figure  lA;  the  learning-curve  is  shown 
in  Figure  IB.  The  total  steps  to  the  goal  for  the  opti¬ 
mal  policy  in  the  underlying  MDP  is  404,  and  so  in  this 
case  a  very  good  memoryless  policy  was  found.  This 
416  step  policy  matches  exactly  with  the  416  step  pol¬ 
icy  Littman  (1994)  found  using  an  expensive  branch 


326  Loch  and  Singh 


A 


p 

p 

P 

P 

R 

R 

p 

* 

p 

p 

P 

R 

R 

p 

R 

p 

p 

P 

R 

R 

R 

R 

p 

R 

P 

P! 

P 

R 

R 

p 

P 

P 

P 

R 

R 

R 

R 

p 

P 

P 

P 

R 

R 

P 

B  Sutton's  Gridworld 


lambda  =  0.9  alpha  =  0.01 


Figure  1:  Sutton’s  Grid  World  (from  Littman,  1994). 
A)  The  grid  world  environment.  The  numbers  on  the 
lower  right  are  the  observations.  The  arrows  show  the 
optimal  memoryless  policy  found  by  Sarsa.  B)  The 
total  steps  to  goal  of  the  greedy  policy  as  a  function 
of  the  amount  of  learning  steps.  The  inset  plot  shows 
the  same  data  at  a  different  scale. 


and  bound  method  that  searches  directly  in  memory¬ 
less  policy  space  and  is  guaranteed  to  find  the  optimal 
memoryless  policy.  Note  that  the  number  of  possi¬ 
ble  memoryless  policies  is  (4  actions,  30  observations) 
430  =  1  2  X  10'®  policies. 

Observe  that  in  Figure  lA  the  agent  learns  to  go  left 
in  the  state  just  to  the  left  of  the  goal.  This  is  because 
it  has  to  go  up  in  the  state  immediately  below  (obser¬ 
vation  18)  because  of  its  aliasing  with  the  state  4  steps 
below  the  goal  (both  states  hav^e  3  walls  to  the  right). 

4.2  Littman,  Cassandra,  and  Kaelbling’s  89 
State  Office  World 

States:  The  gridworld  for  Littman  et  al.’s  (1995)  89 
state  office  problem  is  shown  in  Figure  2A.  The  state  of 
the  environment  is  the  combination  of  the  grid  square 
that  the  agent  is  occupying  and  the  direction  that  the 
agent  is  facing  (N,  S,  E,  W).  The  are  22  possible  agent 
locations  times  4  directions  for  88  states  plus  the  goal 
state  for  89  total  states.  State  transitions  are  stochas¬ 
tic. 

Actions:  The  agent  can  choose  one  of  5  actions:  stay 
in  place,  move  forward,  turn  right,  turn  left,  and  turn 
around.  Both  the  state  transitions  and  the  observa¬ 
tions  are  noisy  with  the  agent  getting  the  correct  ob¬ 
servation  only  70%  of  the  time. 

Observations:  The  agent  can  observe  the  relative 
position  of  obstacles  in  4  directions:  front,  back,  left 
and  right.  There  are  16  possible  observations  plus  the 
goal  observation. 

Rewards:  The  agent  receives  a  reward  of  -1-1  for  any 
action  leading  to  the  goal  observation  with  all  other 
rewards  equal  to  0. 

After  reaching  the  goal  observation  the  agent  transi¬ 
tions  to  a  uniformly  random  start  state. 

4.2.1  Sarsa(A)  Results 

After  every  1000  steps  of  experience  the  greedy  policy 
is  evaluated.  As  in  Littman  et  al.,  for  each  evalua¬ 
tion  251  trials  are  run  using  the  greedy  policy  with  a 
maximum  step  cutoff  at  251  steps.  Two  performance 
metrics  are  used:  the  median  number  of  steps  to  the 
goal  for  the  251  trials,  and  the  percent  of  the  251  trials 
which  reach  the  goal  state  within  251  steps. 

The  best  memoryless  policy  found  by  Sarsa(0.9)  was 
able  to  reach  the  goal  on  average  77%  of  the  251  tri¬ 
als  (see  Figure  2B)  with  a  median  number  of  steps 
to  goal  of  73  steps  (see  Figure  2C).  The  best  policy 
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B  Littman’s  89  state  office  problem 


lambda  =  0.9  alpha  =  0.01 


C  Liftman’s  89  state  office  problem 


lambda  »  0.9.  alpha  =  0.01 


Figure  2:  Littman  et  al.’s  89  state  office  world.  A) 
The  office  world  environment  where  the  goal  state  is 
denoted  with  a  star.  The  state  of  the  environment  is 
the  combination  of  the  grid  square  that  the  agent  is 
occupying  and  the  direction  that  the  agent  is  facing  (N, 
S,  E,  W).  B)  The  percentage  of  trials  with  the  greedy 
policy  that  succeed  in  getting  to  the  goal  in  less  than 
251  steps.  C)  Median  number  of  steps  to  goal  of  the 
greedy  policy  as  a  function  of  the  number  of  learning 
steps. 


found  by  Sarsa(0.9)  outperformed  all  of  the  memory- 
based  policies  found  by  Littman  et  al.  in  their  Table 
3.  Their  best  policy  was  able  to  reach  the  goal  in  only 
44.6%  of  the  251  trials  with  a  median  steps  to  goal  of 
>251  steps  and  was  found  using  truncated  value  it¬ 
eration  algorithm  on  belief  states.  Littman  et  al.  also 
presented  a  hybrid  method  that  finds  a  policy  that 
reached  the  goal  in  58.6%  of  the  trials  (still  below  the 
percentage  for  the  best  memoryless  policy  found  by 
Sarsa(0.9)  with  median  steps  to  goal  of  51  steps  (this 
is  better  than  Sarsa(0.9)’s  73  steps). 

There  are  5^®  =  1.53  x  10^^  possible  memoryless  poli¬ 
cies  for  this  problem.  Therefore  it  is  not  practical  to 
enumerate  the  performance  of  every  possible  policy  to 
verify  if  the  policy  found  by  Sarsa(0.9)  is  indeed  the 
optimal  memoryless  policy,  but  its  performance  vis-a- 
vis  the  state-estimation  based  methods  of  Littman  et 
al.  was  encouraging. 

4.3  Parr  and  Russell’s  Grid  World 

States:  Parr  and  Russell’s  (1995)  gridworld  consists 
of  11  states  in  a  4  by  3  grid  with  a  single  obstacle 
(see  Figure  3A).  The  state  of  the  environment  is  de¬ 
termined  by  the  grid  square  occupied  by  the  agent. 

Actions:  The  agent  can  choose  one  of  4  actions:  move 
north,  move  south,  move  east,  and  move  west.  State 
transitions  are  stochastic  with  the  agent  moving  in  the 
desired  direction  80%  of  the  time  and  slipping  to  either 
side  10%  of  the  time. 

Observations:  The  agent  can  only  observe  if  there 
is  a  wall  to  its  immediate  left  or  right.  There  are  4 
possible  observations  corresponding  to  the  combina¬ 
tions  of  left  and  right  obstacles  plus  two  observations 
for  the  goal  and  penalty  states  yielding  a  total  of  6 
observations.  Observations  are  deterministic. 

Rewards:  There  is  a  goal  state  in  the  upper  right  cor¬ 
ner  with  a  penalty  state  directly  below  the  goal  state. 
The  agent  receives  a  reward  of  -0.04  for  every  action 
which  does  not  lead  to  the  goal  or  penalty  state.  The 
agent  receives  a  reward  of  4-1  for  any  action  leading 
to  the  goal  state  and  a  reward  of  —1  for  any  action 
leading  to  the  penalty  state. 

4.3.1  Sarsa(A)  Results 

Every  1000  steps  the  greedy  policy  was  evaluated  and 
the  learning  curve  is  presented  in  Figure  3B.  The  av¬ 
erage  reward  per  step  was  computed  for  101  trials  of 
up  to  101  steps  per  trial.  There  are  4®  =  4096  pos¬ 
sible  memoryless  policies  for  this  problem.  We  veri- 
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A  Parr  and  Russell’s  4x3  maze:  2  observations 

lambda  =  0.9,  alpha  =  0.001 


B  Parr  and  Russell’s  4x3  maze:  3  observations 

lambda  =  0.9,  alpha  =  0.001 


Figure  3:  Parr  k  Russell’s  Grid  World.  A)  The  grid- 
world  environment.  The  numbers  in  the  lower  right 
are  the  observations.  The  arrows  show  the  optimal 
memoryless  policy  found  by  Sarsa.  B)  The  average  re¬ 
ward  per  action  of  the  memoryless  greedy  policy  as  a 
function  of  the  number  of  learning  steps. 


Figure  4: .  Parr  &  Russell’s  Grid  World.  A)  We  add 
one  past  observation  to  the  immediate  observation. 
The  performance  of  the  greedy  policy.  B)  We  add  two 
past  observations  to  the  immediate  observation.  The 
performance  of  the  greedy  policy.  Note  the  different 
scales. 
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fied  that  Sarsa(0.9)  found  the  optimal  memoryless  pol¬ 
icy  by  evaluating  the  performance  of  all  4096  possible 
policies.  In  this  problem,  the  best  memoryless  policy 
is  rather  poor  compared  to  policies  which  use  mem¬ 
ory.  The  best  memoryless  policy  yields  an  average  re¬ 
ward  per  step  of  0.024  compared  to  the  memory-based 
policy  found  by  the  Witness  algorithm  (Littman  et 
al.,  1995)  which  yields  an  average  reward  per  step  of 
0.1108. 

Parr  k  Russell’s  SPOVA-RL  (Smooth  Partially  Ob¬ 
servable  Value  Approximation  Reinforcement  Learn¬ 
ing)  algorithm  learns  a  value  function  over  belief  states 
and  did  even  better  yielding  an  average  reward  per 
step  of  0.12  with  a  memory-based  policy^. 

The  poor  relative  performance  of  the  optimal  memory¬ 
less  policy  is  due  to  the  non-optimal  actions  the  agent 
must  take  in  the  aliased  states.  For  example  observa¬ 
tion  0  (see  Figure  3A)  is  observed  for  3  states  in  the 
grid.  The  state  to  the  left  of  the  penalty  state  is  ob¬ 
served  as  observation  0  and  causes  the  optimal  action 
in  observation  0  to  be  move  north  instead  of  move  east. 
This  causes  the  agent  to  continuously  bump  into  the 
upper  left  corner  wall  until  the  transition  noise  causes 
a  transition  to  the  state  to  the  east. 

We  investigated  the  performance  improvement  ob¬ 
tained  by  Sarsa(A)  when  the  immediate  observation 
is  augmented  with  1  and  with  2  previous  observations. 
The  performance  of  the  policy  using  1  previous  obser¬ 
vation  yielded  an  average  reward  per  step  of  0.1124 
(see  Figure  4A)  which  is  better  than  the  policy  found 
by  the  Witness  algorithm  and  almost  as  good  as  the 
policy  found  by  SPOVA-RL.  Sarsa(A)  required  fewer 
than  60  CPU  seconds  to  find  its  policy  compared  to 
the  42  CPU  minutes  for  SPOVA-RL  and  the  12  CPU 
hours  required  by  the  Witness  algorithm  (Parr  k  Rus¬ 
sell,  1995).  The  3-observation  performance  is  shown 
in  Figure  4B  and  is  the  same  as  the  2-observation  per¬ 
formance. 

We  were  able  to  verify  that  the  policy  found  by 
Sarsa(A)  using  1  previous  observation  was  indeed  the 
optimal  policy  in  that  space.  Only  ten  2-observation 
sequences  are  encountered  in  the  gridworld  leading  to 
4^°  =  1,048,576  possible  2  observation  policies.  We 
evaluated  the  performance  of  all  possible  2-observation 
policies  and  again  verified  that  the  policy  found  by 
Sarsa(A)  was  the  same  as  the  best  2-observation  pol- 


^Parr  and  Russell  state  that  their  implementation  of  the 
Witness  algorithm  did  not  converge  on  this  problem,  which 
probably  accounts  for  the  better  performance  of  SPOVA- 
RL  relative  to  the  exact  Witness  algorithm. 


icy. 

4.4  Chrisman’s  Shuttle  Problem 

States:  Chrisman’s  (1992)  shuttle  problem  involves 
an  agent  operating  in  an  environment  with  8  states, 
3  actions,  and  5  observations.  The  scenario  consists 
of  two  space  stations  with  loading  docks.  The  task  is 
to  transport  supplies  between  the  two  docks.  There  is 
noise  in  both  the  state  transitions  and  observations. 

Actions:  The  agent  can  execute  one  of  3  actions:  go 
forward,  backup,  and  turn  around. 

Observations:  The  5  observations  are:  can  see  the 
least  recently  visited  (LRV)  station;  can  see  the  most 
recently  visited  (MRV)  station;  can  see  that  we  are 
docked  in  most  recently  visited  (MRV)  station;  can 
see  that  we  are  docked  in  least  recently  visited  (LRV) 
station;  and  can  see  nothing.  There  is  sensor  noise 
causing  the  agent  to  make  faulty  observations. 

Rewards:  The  agent  receives  a  reward  of  +10  when  it 
docks  with  the  least  recently  visited  station.  The  agent 
must  back  into  the  dock  to  dock  with  the  station.  If 
the  agent  collides  with  the  station  by  moving  forward 
it  receives  a  reward  of  —3.  All  other  action  rewards 
are  0. 

4.4.1  Sarsa(A)  Results 

Every  1000  steps  (actions)  the  performance  of  the 
greedy  policy  is  evaluated.  The  performance  metric 
is  the  average  reward  per  step  for  101  trials  of  up 
to  101  steps  (actions)  each.  There  are  (3  actions,  5 
observations)  3^  =  243  possible  memoryless  policies. 
Sarsa(0.9)  finds  a  memoryless  policy  which  yields  an 
average  reward  per  step  of  1.02  (see  Figure  5 A  for 
the  learning  curve).  We  verified  that  the  policy  found 
by  Sarsa(0.9)  was  indeed  the  optimal  memoryless  pol¬ 
icy  by  evaluating  the  performance  of  the  243  possible 
memoryless  policies. 

The  two  best  memory-based  policies  for  Chrisman’s 
shuttle  problem  found  by  Littman  et  al.  (1995)  were 
found  through  truncated  exact  value  iteration  and 
their  Qmdp  method.  Truncated  exact  value  iteration 
found  a  policy  with  an  average  reward  per  step  of  1.805 
while  Qmdp  yielded  1.809.  The  performance  of  the 
optimal  memoryless  policy  is  rather  poor  compared  to 
the  performance  of  policies  using  memory.  This  is  due 
to  the  conservative  nature  of  the  optimal  memoryless 
policy  which  avoids  any  forward  actions  so  as  to  avoid 
receiving  the  —3  penalty  for  hitting  the  station  while 
moving  forward. 
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A  Chrisman's  Shuttle  Problem:  Memoryless 


lambda  «  0.9,  alpha  =  O.Ot 


We  also  investigated  the  performance  improvement  ob¬ 
tained  by  augmenting  the  current  observation  with  1 
and  2  previous  observations.  By  including  the  previ¬ 
ous  observation  the  performance  improved  by  37%  to 
an  average  reward  per  step  of  1.37  (see  Figure  5B).  By 
including  the  2  previous  observations  the  performance 
improved  by  80%  to  an  average  reward  per  step  of 
1.804  (see  Figure  5C).  The  performance  of  the  best 
policy  found  by  Sarsa  with  2  previous  observations  is 
as  good  as  the  truncated  exact  value  iteration  method 
and  the  Qmdp  method,  again  at  a  much  lower  compu¬ 
tational  cost. 

4.5  Discussion 


B  Chrisman's  Shuttle  Problem:  2  observations 


lambda  «  0.9.  alpha  3  0.01 


C  Chrisman's  Shuttle  Problem:  3  observations 


lambda  s  0.9,  alpha  »  0.01 


Figure  5:  Chrisman’s  shuttle  problem.  A)  The  aver¬ 
age  reward  per  action  of  the  memoryless  greedy  policy 
as  a  function  of  the  number  of  learning  steps.  B)  We 
add  one  past  observation  to  the  immediate  observa¬ 
tion.  The  performance  of  the  greedy  policy.  C)  We 
add  two  past  observations  to  the  immediate  observa¬ 
tion.  The  performance  of  the  greedy  policy. 


In  all  the  empirical  results  presented  above  either 
we  were  able  to  confirm  by  enumeration  that  Sarsa 
found  the  best  policy  representable  as  a  mapping  from 
estimated-states  (immediate,  or  immediate  and  past  1 
or  past  2  observations)  to  actions,  or  in  cases  where  it 
was  not  possible  to  enumerate  we  observed  that  Sarsa 
did  as  well  as  the  algorithms  presented  by  the,  origina¬ 
tors  of  the  specific  POMDPs.  Speculating  from  these 
empirical  results,  we  conjecture  that  Sarsa(A)  may  be 
hard  to  beat  in  problems  where  there  exists  a  good 
policy  that  maps  the  observation  space  to  actions. 

4.5.1  Why  do  Eligibility  Traces  Work? 

Consider  the  set  of  states  that  map  onto  the  same  ob¬ 
servation  X.  The  neighbours  of  this  set  of  states  for 
some  action  a  may  map  to  several  different  observa¬ 
tions.  This  can  lead  to  conflicting  pulls  for  the  Q- 
value  of  X,  a  depending  on  which  state  is  providing  the 
experience;  some  may  suggest  a  is  good,  some  may 
suggest  that  a  is  bad.  However  these  different  pulls 
could  get  resolved  if  we  considered  what  happens  af¬ 
ter  n  steps.  Indeed  if  we  wait  until  we  get  to  the  goal 
(Monte-Carlo  or  Sarsa(l))  there  would  be  no  confu¬ 
sion  due  to  the  hidden  state  at  all.  Eligibility  traces 
allow  an  observation-action  pair  to  access  what  hap¬ 
pens  many  time  steps  later,  bridging  the  gap  to  un¬ 
ambiguous  information  about  the  quality  of  an  action. 
This  reasoning  indicates  that  there  may  be  a  minimum 
problem-specific  A  that  would  be  needed  to  bridge  the 
smallest  such  “gap”  in  each  problem.  Our  observa¬ 
tions  during  the  current  work  support  this;  however  a 
careful  analysis  remains  as  future  work. 

5  Conclusion 

Partial  observability  is  inevitable  in  many  sequential 
decision  problems  of  interest  to  both  AI  and  engineer- 
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ing.  Given  the  worst-case  computational  intractabil¬ 
ity  of  POMDPs,  it  is  useful  to  identify  sub-classes  of 
POMDPs  and  algorithms  that  work  well  in  them.  We 
believe  that  eligibility  trace  based  RL  methods  such  as 
Sarsa(A)  can  be  be  useful  in  POMDPs  that  have  good 
memoryless  or  good  low-order-memory-based  policies. 
We  demonstrated  this  empirically  on  four  POMDP 
problems  from  the  recent  literature.  A  more  power¬ 
ful  result  that  remains  future  work  would  be  to  prove 
this  theoretically. 
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Abstract 

This  paper  addresses  the  problem  of  determin¬ 
ing  an  object’s  3D  location  from  a  sequence  of 
camera  images  recorded  by  a  mobile  robot.  The 
approach  presented  here  allows  people  to  “train” 
robots  to  recognize  specific  objects,  by  present¬ 
ing  it  examples  of  the  object  to  be  recognized.  A 
decision  tree  method  is  used  to  learn  significant 
features  of  the  target  object  from  individual  cam¬ 
era  images.  Individual  estimates  are  integrated 
over  time  using  Bayes  rule,  into  a  probabilistic 
3D  model  of  the  robot’s  environment.  Experi¬ 
mental  results  illustrate  that  the  method  enables 
a  mobile  robot  to  robustly  estimate  the  3D  loca¬ 
tion  of  objects  from  multiple  camera  images. 


1  INTRODUCTION 

In  recent  years,  there  has  been  significant  progress  in  the 
field  of  mobile  robotics.  Applications  such  as  robots  that 
guide  blind  or  mentally  handicapped  people,  robots  that 
clean  large  office  buildings  and  department  stores,  robots 
that  assist  people  in  recreational  activities,  etc.,  are  slowly 
getting  in  reach.  Many  of  these  robots  must  integrate 
mobility  with  manipulation.  They  must  be  able  to  move 
around,  and  they  must  also  be  capable  of  manipulating  their 
environment.  For  such  robots,  their  practical  success  will 
partially  depend  on  their  ability  to  identify  and  localize  ob¬ 
jects. 

This  paper  addresses  the  problem  building  robots  that  can 
be  trained  to  recognize  and  locate  user-specified  objects. 
More  specifically,  it  proposes  an  algorithm  that  enables 
people  to  train  robots  by  simply  showing  a  few  poses  of  the 
object.  Once  trained,  the  robot  can  recognize  these  objects 
and  determine  their  location  in  3D  space.  In  contrast  to 


existing  approaches  to  mobile  manipulation,  which  u.sually 
assumes  that  objects  are  located  in  floor  or  table-height, 
our  approach  does  not  make  restrictive  assumptions  as  to 
where  the  object  is  located.  This  poses  new  challenges  on 
the  ability  to  localize  objects,  as  a  single  camera  image  is 
insufficient  to  determine  the  location  of  an  object  in  3D 
space. 

The  approach  proposed  here  uses  probabilistic  rcpre.senta- 
tions  to  estimate  the  identity  and  location  of  the  target  ob¬ 
ject  from  multiple  views.  It  maps  camera  images  into  2D 
probabilistic  maps,  which  describe,  for  each  pixel  in  the 
camera  image,  the  likelihood  that  this  pixel  is  part  of  the 
target  object.  This  mapping  is  established  by  a  decision  tree 
applied  to  local  image  features,  which  is  constructed  during 
the  training  phase  from  labeled  images.  The  2D  probabilis¬ 
tic  map  is  then  projected  into  the  3D  work  space,  ba,sed  on 
straightforward  geometric  considerations.  Since  a  single 
camera  image  is  insufficient  to  determine  the  location  of 
an  object  in  3D,  our  approach  integrates  information  from 
multiple  images,  taken  from  multiple  viewpoints.  It  em¬ 
ploys  Bayes  rule  to  generate  a  consistent  probabilistic  3D 
model  of  the  workspace.  Our  approach  also  takes  into  ac¬ 
count  the  uncertainty  introduced  by  robot  motion,  by  using 
a  probabilistic  model  of  robot  motion.  As  the  robot  moves 
in  the  environment  taking  images,  it  gradually  improves  the 
estimation  of  the  identity  and  location  of  an  object,  until  it 
finally  knows  what  and  where  the  object  is.  Experimental 
results  using  a  RWI  B21  robot  equipped  with  a  color  cam¬ 
era  show  that  multi-part  objects  can  be  located  robustly  and 
with  high  accuracy. 

The  remainder  of  this  paper  is  organized  as  follows.  In  sec¬ 
tion  2,  we  briefly  describe  decision  trees  along  with  the  way 
our  approach  uses  them  for  characterizing  images.  In  sec¬ 
tion  3,  we  show  how  image  information  is  integrated  into 
a  3D  model,  and  provide  a  method  for  accommodating  the 
uncertainty  that  is  introduced  by  robot  motion.  In  section  4, 
we  present  experimental  results,  obtained  with  a  RWI  B2I 
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Figure  1 :  The  top  few  nodes  of  an  example  decision  tree.  A 
leaf  represents  the  probability  conditioned  on  the  values  of 
the  attributes.  Internal  nodes  test  on  the  fraction  of  positive 
pixels  of  a  tile  that  fall  in  the  corresponding  hue  range. 


robot,  followed  by  a  survey  of  related  research  (section  5). 
Finally,  in  section  6,  we  comment  on  the  assumptions  and 
limitations  of  the  approach  and  suggest  directions  for  fu¬ 
ture  research. 

2  DECISION  TREE  LEARNING 

A  decision  tree  is  a  succinct  and  explicit  way  of  repre¬ 
senting  a  multidimensional  discrete-valued  function  /  : 
Tl"  X  X’^  y,  where  X  and  y  are  finite  sets  of  discrete 
elements  and  TZ  is  the  set  of  real  numbers.  The  (n  +  m) 
inputs  to  this  function  frequently  correspond  to  discrete 
and/or  continuous-valued  attributes  of  an  object  and  the 
output  represents  an  object’s  property  that  we  want  to  pre¬ 
dict.  Each  node  of  the  tree  is  associated  with  a  partition 
of  the  input  space.  An  internal  node  further  partitions  its 
space  into  two  subspaces  based  on  the  value  of  a  single  in¬ 
put  variable,  associating  each  of  the  resultant  subspaces  to 
each  of  the  two  children.  The  set  of  decision  trees  is  com¬ 
plete  in  the  space  of  discrete-valued  functions  i.e.  any  such 
function  can  be  represented  by  at  least  one  decision  tree. 
An  example  of  a  decision  tree,  obtained  in  the  context  of 
image  analysis  in  a  fashion  similar  to  the  one  used  in  this 
paper  (see  below),  is  illustrated  in  Fig.  1. 

Our  approach  uses  decision  trees  to  approximate  condi¬ 
tional  probability  density  functions.  Decision  trees  are  usu¬ 
ally  used  to  answer  YES/NO  queries  regarding  the  output 
value  of  /  given  an  input  tuple  of  values.  If,  for  example,  / 
is  a  boolean-output  function,  querying  is  typically  done  by 
comparing  the  number  of  positive  and  the  number  of  nega¬ 
tive  training  examples  that  were  assigned  during  training  to 
the  leaf  node  that  is  associated  with  the  partition  that  the  in¬ 
put  tuple  lies  in.  The  algorithm  would  then  return  the  value 
(YES/NO)  that  is  in  majority  in  that  leaf.  In  our  use  of  a  de¬ 
cision  tree  we  differ  in  that  we  instead  output  the  fraction 


of  positive  or  negative  examples  found  in  the  leaf.  As  such, 
we  use  the  decision  tree  to  represent  an  approximation  of 
the  probability  density  function  on  the  output  space  condi¬ 
tioned  on  the  values  of  attributes  in  the  input  space  of  /. 
If  appropriately  pruned  (during  a  post-pruning  phase  that 
is  intended  to  increase  compactness  dnd,  more  importantly, 
generalization  over  future  data),  these  probabilities  are  usu¬ 
ally  not  zero  or  one  because  of  training  set  noise  in  either 
the  values  of  the  inputs  or  the  output  or  non-determinism 
due  to  use  of  a  set  of  input  variables  that  is  insufficient  to 
deterministically  model  /. 

2.1  A  PDF  FOR  CHARACTERIZING  AN  IMAGE 

Our  approach  uses  a  decision  tree  to  map  (filter)  camera  im¬ 
ages  into  2D  probabilistic  maps,  which  describe  the  prob¬ 
ability  of  the  presence  of  a  target  object  at  the  various  lo¬ 
cations  in  the  image.  More  specifically,  the  inputs  to  the 
tree  are  image  features  in  a  local  region  (called:  tile)  in 
the  image,  and  the  output  is  a  probability  value  that  mea¬ 
sures  the  likelihood  of  the  presence  of  a  target  object  in  the 
respective  tile.  In  principle,  our  approach  can  be  applied 
to  arbitrary  image  features  (e.g.,  pixels,  edges,  brightness, 
color,  texture,  etc.).  In  our  implementation,  local  color  his¬ 
tograms  are  used  as  input  to  the  decision  tree. 

The  tree  is  learned  using  labeled  training  examples.  More 
specifically,  construction  of  the  training,  test  and  pruning 
sets  is  done  using  the  following  procedure: 

1.  An  input  picture  is  obtained. 

2.  A  rectangle  R  is  drawn  around  the  object  by  the  user. 
This  might  include  parts  of  the  background. 

3.  The  image  is  divided  in  a  matrix  of  non-overlapping 
rectangular  tiles,  completely  covering  its  surface.  The 
size  of  each  tile  is  small  relative  to  the  projection  of 
the  object  on  the  image.  8  x  8  is  used  in  this  paper. 

4.  Each  tile  is  used  to  construct  a  single  positive  or  nega¬ 
tive  example.  The  features  that  occur  in  the  tile,  which 
can  be  continuous  or  discrete,  are  extracted  and  used 
as  input  values  for  the  example  associated  with  that 
tile. 

5.  Depending  on  whether  each  tile  is  fully  contained 
within  R  or  not,  the  example  is  assigned  to  be  posi¬ 
tive  or  negative,  respectively. 

This  set  of  examples  is  equally  divided  into  training,  test 
and  pruning  sets,  and  these  are  used  in  growing  a  decision 
tree  for  that  combination  of  object  and  environment  that  it 
was  seen  in.  The  resulting  tree,  when  applied  to  new  im¬ 
ages  within  that  environment,  provides  probability  densi¬ 
ties  for  the  presence  of  a  target  object. 

Figure  2  illustrates  our  method.  Shown  there,  in  the  top 
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Figure  2:  Detection  of  a  bottle  from  previous  examples. 
The  top  row  contains  images  where  the  outlined  part  con¬ 
tains  the  tiles  used  as  positive  examples.  The  rest  of  the 
image’s  tiles  are  negative  examples.  Probabilities  above  0.8 
are  marked  in  the  previously  unseen  picture  in  the  bottom 
row.  Not  shown  is  another  set  of  1 8  “background”  pictures 
consisting  of  negative  examples  only. 


row,  is  a  series  of  three  training  images.  The  target  object 
is  labeled  by  hand.  The  bottom  row  shows  a  test  image, 
along  with  the  probability  field  generated  by  the  tree.  As 
can  be  seen  there,  the  algorithm  assigned  high  likelihood  to 
the  correct  location,  but  also  misclassified  a  small  number 
of  regions  in  the  image  background.  From  this  single  cam¬ 
era  image,  it  is  impossible  to  determine  the  location  of  the 
target  object  in  3D  coordinates.  The  remainder  of  this  pa¬ 
per  describes  our  approach  to  integrating  these  probabilistic 
estimates  in  3D  space. 

3  INTEGRATING  MULTIPLE  CAMERA 
IMAGES  IN  3D 

Our  approach  integrates  the  probabilistic  information,  ex¬ 
tracted  from  individual  images,  into  a  spatial  3D  model  of 
the  world.  Information  about  the  location  of  the  object  is 
represented  as  a  3D  occupancy  grid.  Each  grid  cell  is  as¬ 
sociated  with  an  approximation  of  the  probability  that  part 
of  the  object  occupies  that  particular  cell.  Each  such  prob¬ 
ability  is  initialized  with  a  number  that  corresponds  to  a 
prior  belief  that  the  object  occupies  a  cell  given  no  infor¬ 
mation  about  the  world.  This  number  can  be  learned  from 
data,  typically  though  counting  according  to  the  frequen- 
tists’  approach  to  probability.  The  exact  value  of  the  prior 
is  not  significant  in  the  long  term,  since  the  value  will  con¬ 
verge  towards  the  actual  probability  after  a  sufficient  num¬ 
ber  of  observations.  However,  if  there  is  evidence  that  the 
object  in  question  occurs  more  frequently  in  certain  areas 


(for  example,  a  shoe  may  be  expected  to  lie  on  the  floor 
most  of  the  time),  this  information  can  be  used  to  appro¬ 
priately  initialize  prior  probabilities  and  assign  higher  val¬ 
ues  to  these  locations.  During  detection,  each  information¬ 
gathering  step  is  followed  by  an  updating  of  the  probability 
of  each  cell  according  to  Bayes  law,  as  described  below. 
Robot  motion  also  affects  the  grid  due  to  uncertainty  of  the 
robot’s  translational  and  rotational  velocities. 

3.1  INFORMATION  INTEGRATION 

The  key  idea  for  mapping  2D  image  information  into  a  3D 
spatial  representation  is  to  map  image  tiles  into  pyramids 
in  space.  Each  image  obtained  from  the  environment  pro¬ 
vides  us  with  information  about  the  location  of  parts  of  the 
object.  Since  we  assume  a  single  camera  input,  we  have  no 
information  about  the  depth  of  features  contained  in  one  of 
the  tiles  of  the  image.  We  therefore  make  no  assumption  on 
the  distance  of  the  part  from  the  eyepoint.  However,  we  do 
obtain  information  about  the  Euler  angles  (azimuth  9  and 
altitude  (f>)  of  the  feature  with  respect  to  the  robot’s  current 
location.  In  particular  we  know  that  it  is  contained  within 
the  pyramid  emanating  from  the  eyepoint  whose  four  con¬ 
verging  sides  intersect  the  four  comers  of  the  tile  on  the  im¬ 
age  plane  that  is  perpendicular  to  the  direction  the  camera 
is  facing.  Grid  cells  intersecting  this  pyramid  are  therefore 
updated  using  Bayes  law. 

An  example  of  the  updating  is  shown  in  Fig.  3.  Here  two 
different  pyramids  arc  shown  (projected  into  the  x-y  plane), 
which  have  been  generated  from  camera  images  taken  at 
different  locations.  Bayes  rule  is  applied  to  integrate  these 
pyramids,  in  order  to  generate  a  single,  consistent  belief. 

The  integration  works  as  follows.  The  probability  that  a 
part  of  the  object  occupying  a  cell  at  grid  location  (x,  y,  z) 
at  time  t  is  denoted  by  y,  z,t)].  Coordinates  x,  y 

and  2  are  with  respect  to  a  fixed,  world-centered  coordi¬ 
nate  system  (they  are  not  local  robot-centered  coordinates). 

y,  z,  t)  is  a  boolean  random  variable  denoting  the  exis¬ 
tence  of  a  part  of  the  object  at  a  location  somewhere  inside 
the  corresponding  grid  cell.  In  the  following  we  will  use  ^ 
instead  of^(a’,  j/,  2,t)  for  the  sake  of  brevity.  Ifi(t)  denotes 
the  image  obtained  at  time  t  and  D{t  -  1)  the  set  of  pre¬ 
vious  images/motion  commands  in  all  previous  steps,  the 
probability  value  p(^)  at  grid  cell  location  ^  is  computed  as 
follows: 


p(0 


PrK  h-(t),D(t-l)] 


m  I  Djt  -  1)] 

Pr[i(0|D(/-l)]- 
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tile  corresponding  to  the  cell  at  (a;,  y,  z)  by  only  taking  the 
current  image  into  account.  In  estimation  problems  of  this 
type,  it  is  common  practice  to  compute  the  odds-ratio,  for 
which  Pr[i(t)]  and  Pr[i(0  I  ~  1)]  cancel  out: 


odds-ratio{^)  = 


P{0  = 


P(0 

PrK  I  m  PrK  I  D{t  -  1)] 

1  -  PrK  I  m  1  -  PrK  I  D(t  -  1)] 
1-PrK]  ^ 

PrK] 

odds-ratio{^) 

1  +  odds-ratio{^) 


Similar  formulas  for  belief  integration  can  be  found  in 
[Pea88,  Thr98b]. 


Figure  3:  On  the  top,  the  robot  used  in  our  experiments 
is  shown.  It  is  equipped  with  a  parallel  two-fingered  grip¬ 
per  for  object  manipulation.  On  the  bottom,  an  illustration 
is  presented  of  how  information  from  images  taken  from 
two  different  viewpoint  is  integrated  in  the  occupancy  grid. 
Shown  to  the  left  are  two  single  projections  applied  to  an 
“empty”  grid.  The  picture  on  the  right  shows  how  they  are 
combined  together.  The  images  depict  the  average  values 
of  grid  cell  probabilities  when  viewed  from  above  (i.e.  av¬ 
eraging  probability  values  along  the  z-axis). 


Pr[i^  I  D{t  —  1)]  is  the  prior  probability  accumulated  in  the 
cell  from  previous  iterations  of  the  procedure,  which  takes 
into  account  all  previous  data.  Pr[i(f)  |  D{t  —  1)]  = 
Pr[i(t)  I  by  making  a  Markov  conditional  independence 
assumption  that  implies  that,  given  the  fact  of  the  existence 
or  not  of  part  of  the  object  in  the  cell,  the  image  obtained 
does  not  depend  on  previous  images.  Under  this  assump¬ 
tion,  by  using 


Pr[i(t)  I  = 
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we  obtain 
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where  ^  is  the  complement  of  event  Pr[^  |  i(f)]  is  the 
probability  estimate  returned  by  the  decision  tree  for  the 


3.2  ROBOT  MOTION 

Each  robot  motion  introduces  uncertainty  into  the  robot’s 
estimate  of  the  object’s  location  because  of  imperfect  actu¬ 
ators  and  measuring  devices.  We  model  the  translational  as 
well  as  rotational  magnitude  of  the  velocity  of  the  robot  as  a 
Gaussian  random  variable  with  mean  equal  to  the  nominal 
velocity  given  to  the  robotic  motion  controller — we  make 
the  assumption  that  there  are  no  systematic  errors.  The 
standard  deviations  used  are  pessimistic  estimates  of  the 
deviation  around  the  nominal  corresponding  velocity  mag¬ 
nitude.  The  accurate  determination  of  the  standard  devia¬ 
tions  does  not  significantly  influence  our  location  estimates 
given  frequent  enough  observations.  Under  this  assump¬ 
tion,  their  actual  value  is  not  critical  and  can  be  overesti¬ 
mated. 

If  the  magnitude  of  the  velocity  is  normally  distributed  with 
mean  vq  and  standard  deviation  (r„,  v  ~  N{vq,(tI)  (as¬ 
sume  one-dimensional  for  the  purpose  of  this  example),  the 
location  of  a  object  with  that  velocity  after  time  t  is  a  ran¬ 
dom  variable  x  ~  N{vot,  cr^t^),  also  normally  distributed, 
with  mean  vot  and  standard  deviation  This  suggests 
that  uncertainty  of  an  objects  location  increases  with  time 
as  time  goes  by,  as  shown  in  Fig.  4. 

4  EXPERIMENTAL  RESULTS 

We  conducted  our  experiments  on  a  B21  mobile  robot 
equipped  with  a  single  Sony  XC-999  color  camera  with  a 
6mm  focal  length  lens,  mounted  on  a  pan-tilt  unit.  Images 
of  size  240  x  256  are  acquired  through  a  Matrox  Meteor 
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Figure  4:  Probabilistic  model  of  robot  motion.  Top  im¬ 
age:  Belief  of  the  location  of  the  object  deteriorates  in  time 
under  uncertainty  of  the  magnitude  of  the  velocity.  Here 
V  ~  A^(10, 1^).  Bottom  image:  This  graph  illustrates  the 
outcome  of  specific  motion  commands  projected  along  the 
z  axis  (a  translation  and  a  rotation). 


framegrabber  connected  to  the  camera  and  are  used  to  train 
a  decision  tree  in  the  manner  described  in  section  2.1. 

We  chose  a  simple  histogram  representation  of  down- 
sampled  versions  of  the  training  images  as  the  input  fea¬ 
tures  to  our  decision  tree  algorithm.  In  particular,  we  use 
color  histograms  for  each  tile,  at  resolution  of  256  color 
bins.  Therefore  each  tile  represents  an  example  of  256  in¬ 
put  features,  namely  the  pixel  percentages  at  each  color 
bin,  and  one  binary-valued  output,  corresponding  to  the 
event  that  the  tile  is  part  of  the  object  being  trained  on. 
Even  though  this  choice  of  input  features  does  not  take 
into  account  all  information  present  in  the  picture,  this  is 
simply  an  artifact  of  the  current  implementation  and  by 
no  means  imposes  any  restriction  on  the  choice  of  input 
features  of  the  approach  in  general.  More  complex  fea¬ 
tures  may  be  employed  in  future  implementations.  How¬ 
ever,  as  we  demonstrate  below,  this  simple  representation 
performs  adequately  well  in  certain  frequently  occurring 
situations  where  the  object  is  sufficiently  distinct  from  the 
background,  containing  enough  information  for  recovering 
the  approximate  location  of  simple  objects  in  3D.  The  “dis¬ 
tinctiveness”  is  determined  by  the  resolution  of  our  color 
histogram,  coupled  with  the  amount  of  hue  variation  that 
changes  in  light  intensity  on  the  object  result  in. 
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Figure  5:  Probability  map  that  is  the  output  of  the  decision 
tree  trained  to  recognize  the  red  chair.  The  brightest  tiles  in 
the  probability  map  (second  column)  correspond  to  prob¬ 
ability  greater  than  0.9.  Projection  of  the  map  in  3D  are 
shown  in  the  last  three  columns,  as  averages  along  the  x,  y 
and  2  (rightmost  column)  axis  respectively. 


An  example  application  of  a  decision  tree  trained  on  three 
examples  with  an  object  (in  this  case,  a  bottle)  and  18 
background  images  (containing  negative  examples  only)  is 
shown  in  Fig.  2.  The  top  few  nodes  of  the  tree  are  shown  in 
Fig.  1.  In  a  similar  fashion  we  constructed  a  decision  tree 
to  recognize  a  larger  simple  object,  a  red  chair,  by  using 
the  same  all-negative  example  images  and  three  additional 
images  containing  the  chair  in  different  poses.  We  then 
manually  maneuvered  the  mobile  robot  around  the  chair 
taking  7  new  pictures  from  different  angles.  These  pictures 
are  shown  in  Fig.  5.  The  second  column  in  that  figure  de¬ 
picts  the  probability  map  that  is  output  from  the  decision 
tree  for  each  image.  At  certain  locations  we  acquired  im¬ 
ages  and  projected  the  probability  map  in  3D,  with  each 
probability  map  element  corresponding  to  a  pyramid,  as  de¬ 
scribed  in  3.1.  Every  cell  covered  by  a  pyramid  is  affected 
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by  the  corresponding  probability  in  the  probability  map. 
The  results  of  projection  when  viewed  along  the  x,  y  and 
z  axes  are  shown  in  the  three  rightmost  columns  in  Fig.  5. 
Each  pixel  in  these  projections  has  intensity  proportional  to 
the  average  probability  along  the  axis  of  projection  passing 
from  that  pixel.  The  z-axis  projections  make  the  locations 
around  the  chair  that  the  pictures  were  taken  particularly 
easy  to  see. 

In  reality,  the  robot  does  not  keep  a  3D  grid  for  each  im¬ 
age  but  rather  incorporates  information  incrementally  in  the 
single  grid  it  maintains,  which  is  justified  under  the  Markov 
assumption.  This  is  done  by  applying  Bayes  law  for  each 
cell  individually.  There  is  no  normalization  done  over  the 
whole  grid,  which  corresponds  to  the  semantics  we  assign 
to  the  probability  stored  at  each  cell:  it  represent  the  prob¬ 
ability  that  a  part  of  the  object  occupies  that  cell.  As  such, 
we  make  no  assumptions  about  the  size  of  the  object  with 
respect  to  the  cell  size. 

Between  images,  the  robot  is  maneuvered  manually  to  the 
spot  where  the  next  image  will  be  taken.  These  motions 
increase  our  uncertainty  in  the  manner  described  in  sec¬ 
tion  3.2.  The  robot  used  in  the  experiments  is  a  semi- 
holonomic  one,  its  motion  consisting  of  rotations  and  for¬ 
ward  or  backward  motions  in  the  direction  it  is  facing.  As 
such  we  model  rotational  and  translational  uncertainty  in 
the  magnitude  of  the  velocity. 

The  updating  of  the  grid  using  the  above  procedure  is 
shown  in  Fig.  6  for  one  run.  This  sequence  of  beliefs  cor¬ 
responds  to  a  situation  where  a  robot  faces  a  chair.  The 
grid  size  used  is  100  x  100  x  100  and  each  unit  along  any 
direction  corresponds  to  4cm  in  the  real  world.  All  beliefs 
shown  in  Fig.  6  are  projected  horizontally. 

As  can  be  seen  in  Fig.  6,  the  initial  location  of  the  target 
object(s)  is  unknown.  After  taking  a  first  image,  the  robot’s 
belief  is  a  conjunction  of  pyramids,  corresponding  to  the 
output  of  the  decision  tree.  As  the  robot  moves,  it  loses 
information.  As  it  takes  the  second  snapshot  from  a  dif¬ 
ferent  perspective,  the  belief  is  refined.  After  taking  seven 
images,  the  location  and  the  shape  of  the  target  object  are 
reconstructed  with  high  accuracy.  As  these  results  demon¬ 
strate,  our  approach  can  accurately  determine  the  location 
of  the  target  object.  It  is  also  robust  to  errors  in  the  robot’s 
odometry.  This  robustness  is  a  result  of  incorporating  our 
probabilistic  model  of  robot  motion. 

5  RELATED  WORK 

Decision  trees  [Qui86,  Qui93,  Mit97]  are  one  of  the  most 
popular  inductive  machine  learning  method  to  date.  The 
early  algorithms  were  only  applicable  to  problems  with 


discrete  input  and  output  spaces.  Decision  tree  learning 
algorithms  in  AI  for  real-valued  input  spaces  were  pro¬ 
posed  by  [BFOS84],  as  a  reinvention  of  earlier  work.  Tree- 
based  regression  methods  for  real-valued  input  and  output 
spaces  can  also  be  found  in  [Fri91,  Moo90].  The  work  pre¬ 
sented  in  this  paper  provides  an  example  where  a  decision 
tree  is  used  to  learn  a  conditional  probability  density  func¬ 
tion.  Like  the  approaches  presented  in  [FI93,  MKS94],  it 
partitions  a  real-valued  high-dimensional  input  space  into 
hypercubes.  The  output  nodes,  however,  represent  con¬ 
ditional  densities,  which  are  estimated  using  a  frequen- 
tist  approach  [CB90].  This  is  related  to  results  reported 
in  [TLS89,  Mac92,  Mit97],  which  show  that  under  appro¬ 
priate  assumptions,  artificial  neural  networks  approximate 
conditional  probability  density  functions. 

The  mathematical  approach  for  integrating  information  is 
adopted  from  the  statistical  literature  [CB90,  Pea88].  The 
approach  presented  in  this  paper  also  bears  close  resem¬ 
blance  to  occupancy  grids  [Mor88,  Elf89].  Occupancy  grid 
approaches  are  popular  techniques  for  learning  models  of 
mobile  robot  environments  from  sensor  data.  Just  like  the 
approach  proposed  here,  they  represent  the  environment  us¬ 
ing  fine-grained,  evenly  spaced  grids.  Each  grid  point  is  an¬ 
notated  by  a  probability,  which  describes  the  evidence  that 
a  location  contains  an  object/obstacle.  The  vast  majority 
of  existing  approaches  differs  from  the  one  proposed  here 
in  three  aspects.  First,  they  model  occupancy,  not  the  lo¬ 
cation  of  a  specific  target  object.  Second  they  are  usually 
constructed  from  range  measurements  (e.g.,  sonar,  laser), 
not  from  camera  images.  Third,  they  are  usually  two- 
dimensional.  There  are,  however,  notable  exceptions.  The 
approaches  described  in  [MM94,  TBB+98]  construct  oc¬ 
cupancy  grids  from  sequences  of  camera  images.  Moravec 
and  Martin’s  approach  [MM94]  has  probably  been  the  first 
to  construct  3D  grids,  instead  of  the  commonly  used  2D 
representations.  Both  approaches,  however,  used  stereo 
vision  to  estimate  the  location  of  obstacles.  Stereo  vi¬ 
sion  generates  distance  estimates,  which  greatly  facilitates 
the  construction  of  the  maps.  The  approach  reported  here 
estimates  distance  indirectly,  through  integrating  multiple 
camera  images  recorded  at  different  locations.  Unfortu¬ 
nately,  the  approach  in  [MM94]  is  incapable  of  dealing 
with  error  in  the  robot’s  odometry. 

Object-centered  3D  object  reconstruction  has  also  been  in¬ 
vestigated  in  the  context  of  computer  vision.  Two  ap¬ 
proaches  have  emerged.  One  models  objects  as  3D  sur¬ 
faces,  typically  represented  as  a  polygonal  meshes.  For  ex¬ 
ample,  [FL95]  uses  stereo  and  intensity  matching  to  con¬ 
struct  and  fit  the  mesh.  The  second  approach  uses  a  grid 
representation  essentially  similar  to  the  one  used  in  this  pa¬ 
per  (e.g.  [Col96]),  and  employs  a  technique  sometimes  re- 
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Figure  6;  Cumulative  effects  on  motion  and  probability  map  projection  on  grid  as  viewed  along  the  x-axis  (that  is  running 
perpendicular  from  the  door  facing  the  interior  of  the  room  in  the  pictures  in  Fig.  5).  The  two  distinct  parts  of  the  chair 
(back  and  seat)  are  discernible. 
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ferred  to  as  “3D  voting”  to  update  cell  “occupancies.”  This 
differs  from  our  approach  in  two  ways:  first,  cells  are  up¬ 
dated  by  counting  votes  in  a  straightforward  if  ad  hoc  man¬ 
ner  which  employs  techniques  such  as  voting  for  cells  in  a 
radius  of  the  intersecting  with  the  line  through  the  eyepoint 
and  the  line  segment.  This  is  necessitated  partly  from  the 
inability  to  model  inaccuracies  in  the  viewpoint  location, 
although  in  many  such  applications — for  example,  military 
aerial  photography — the  camera  location  is  estimated  rel¬ 
atively  accurately.  Second,  these  techniques  do  not  learn 
a  probabilistic  model  of  the  set  of  features  that  are  em¬ 
ployed  from  examples.  As  such,  all  features  are  equally 
weighted,  necessitating  the  use  of  a  threshold — in  order  to 
produce  a  recognizable  picture — the  selection  of  which  can 
be  difficult  (although  see  [Col96]  for  a  statistical  approach 
to  threshold  estimation). 

Our  approach  is  similar  to  Markov  localization  [BFHS96, 
NPB95,  SK95,  Thr98a],  a  method  for  probabilistically  es¬ 
timating  the  pose  of  a  mobile  robot  in  a  (known)  environ¬ 
ment.  Markov  localization  relies  on  the  same  statistical 
principles  for  integrating  multiple  sensor  readings  into  a 
single  belief.  In  fact,  the  approach  in  [BFHS961  uses  the 
same  basic  representations  as  our  approach:  evenly  spaced 
grids.  Markov  localization,  however,  rests  on  the  assump¬ 
tion  that  there  is  exactly  one  object  {i.e.,  the  robot)  whose 
location  is  to  be  estimated.  Our  approach  can  handle  situ¬ 
ations  that  contain  a  variable  (unknown)  number  of  target 
objects. 

Finally,  the  problem  of  finding  and  manipulating  objects 
has  received  considerable  attention  within  the  AI  commu¬ 
nity  (see  [Hor94]  and  various  papers  in  [Sim95,  KBM98]). 
For  example,  Buhmann  et  al.  [BBC'''95]  described  an  ap¬ 
proach  where  a  robot  could  be  trained  to  recognize  specific 
objects.  Most  existing  approaches  in  the  mobile  robot  com¬ 
munity,  however,  make  the  assumption  that  the  object  is  lo¬ 
cated  in  floor-height,  in  which  case  camera  coordinates  can 
directly  be  converted  to  real-world  coordinates.  Our  ap¬ 
proach  is  specifically  designed  to  find  objects  at  arbitrary 
locations  in  space.  This  is  important  in  many  real-world 
applications,  as  objects  may  frequently  be  found  in  tables, 
chairs,  etc. 

6  DISCUSSION  AND  FUTURE 
RESEARCH 

This  paper  presented  a  novel  approach  to  estimating  the 
3D  location  of  an  object  with  a  mobile  robot.  Individ¬ 
ual  camera  images  are  interpreted  using  a  decision  tree 
method,  which  maps  image  regions  (tiles)  into  probabilis¬ 
tic  estimates  for  the  presence  of  target  objects.  Based  on 
a  straightforward  geometric  consideration,  these  probabil¬ 


ities  are  mapped  into  3D  pyramids  in  global  world  co¬ 
ordinates.  Multiple  pyramids,  obtained  from  camera  im¬ 
ages  recorded  from  different  viewpoints,  are  integrated  us¬ 
ing  Bayes  rule  into  a  single  probabilistic  model  of  the  ob¬ 
ject  location.  Noise  in  robot  motion  is  accounted  for  by  a 
probabilistic  model  of  robot  motion.  Experimental  results 
demonstrate  that  the  method  can  robustly  localize  objects 
in  3D  space. 

A  key  advantage  of  the  current  approach  is  its  generality. 
No  assumption  is  made  concerning  the  typical  location  of 
objects  (e.g.,  they  are  not  assumed  to  lie  on  the  floor).  The 
approach  can  also  be  trained  easily  to  recognize  new,  user- 
specified  objects.  While  our  current  implementation  uses 
color  as  the  primary  cue  for  object  recognition,  the  method 
can  equally  be  applied  to  a  much  richer  range  of  image 
features,  making  it  fit  for  a  large  class  of  target  objects  (i.e., 
objects  that  can  be  recognized  from  local  image  features). 

Our  approach  rests  on  several  limiting  assumptions.  First 
of  all,  it  assumes  that  object  does  not  move.  To  accommo¬ 
date  moving  objects,  our  approach  would  have  to  be  ex¬ 
tended  by  a  probabilistic  model  of  object  motion.  Such  a 
model  might  characterize  the  typical  motion  speed  of  the 
target  object.  It  is  unclear,  however,  if  such  an  approach 
would  be  able  to  gather  sufficient  information  to  estimate 
the  location  of  a  moving  object  with  the  necessary  accu¬ 
racy. 

Our  approach  also  assumes  that  the  training  images  ac¬ 
curately  represent  the  situation  during  testing.  In  our  ex¬ 
periments,  we  usually  enriched  the  training  set  by  a  small 
number  of  pictures  recorded  at  random  locations  in  our 
lab.  These  pictures  were  used  as  negative  training  exam¬ 
ples  when  growing  the  tree.  We  found  that  these  addi¬ 
tional  images  increased  the  robustness  of  the  image  analy¬ 
sis,  thereby  improving  the  overall  estimation  results.  How¬ 
ever,  the  method  might  fail  if  the  robot  encounters  an  object 
which  similar  to  the  target  object,  but  which  has  not  been 
part  of  its  training  set. 

The  spatial  resolution  in  the  experiments  described  in  this 
paper  is  low,  due  the  enormous  complexity  involved  in  up¬ 
dating  3D  grids.  By  choosing  a  4cm  resolution,  the  compu¬ 
tational  overhead  was  manageable.  Denser  and  larger  grids 
are  desirable,  but  unfortunately  the  computational  cost  of 
of  updating  the  grid  is  cubic  in  the  number  of  grid  cells. 
An  interesting  extension  of  the  current  approach  would  be 
to  use  variable-resolution  representations,  such  as  oct-trees 
[Sam89b,  Sam89a,  Moo90],  for  representing  object  loca¬ 
tion.  Such  representations  could  balance  the  computational 
and  memory  resources,  by  modeling  regions  coarsely  that 
are  unlikely  to  contain  a  target  object.  If  the  density  of  tar¬ 
get  objects  is  low  (which  is  usually  the  case),  such  an  ex- 
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tension  could  improve  the  computational  efficiency  of  the 
approach  substantially. 

Another  promising  extension  of  the  current  approach 
would  be  to  devise  methods  that  actively  control  the  robot 
so  as  to  maximize  information  gain.  In  the  experiments 
presented  here,  a  human  manually  positioned  the  robot.  In 
our  previous  work  [Thr98b],  however,  we  already  devel¬ 
oped  successful  methods  for  active  information  gathering, 
which  were  applied  in  the  context  of  learning  2D  occu¬ 
pancy  grid  maps.  In  the  context  of  object  localization,  such 
methods  could  lead  to  a  behavior  where  a  robot  investigates 
the  object  from  multiple  viewpoints,  in  order  to  estimate  its 
location  accurately.  The  development  of  such  methods  is 
subject  to  future  research. 
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Abstract 

Multiple-Instance  learning  is  a  way  of  mod¬ 
eling  ambiguity  in  supervised  learning  exam¬ 
ples.  Each  example  is  a  bag  of  instances,  but 
only  the  bag  is  labeled  -  not  the  individual 
instances.  A  bag  is  labeled  negative  if  all  the 
instances  are  negative,  and  positive  if  at  least 
one  of  the  instances  in  positive.  We  apply 
the  Multiple-Instance  learning  framework  to 
the  problem  of  learning  how  to  classify  nat¬ 
ural  images.  Images  are  inherently  ambigu¬ 
ous  since  they  can  represent  many  different 
things.  A  user  labels  an  image  as  positive 
if  the  image  somehow  contains  the  concept. 
Each  image  is  a  bag,  and  the  instances  are 
various  sub-regions  in  the  image.  From  a 
small  collection  of  positive  and  negative  ex¬ 
amples,  we  can  learn  the  concept  and  then 
use  it  to  retrieve  images  that  contain  the  con¬ 
cept  from  a  large  database.  We  show  that 
the  Diverse  Density  algorithm  performs  well 
in  this  task,  that  simple  hypothesis  classes 
are  sufficient  to  classify  natural  images,  and 
that  user  interaction  helps  to  improve  perfor¬ 
mance. 


1  INTRODUCTION 

Scene  classification  is  an  open  problem  in  machine  vi¬ 
sion  and  has  applications  in  image  and  video  database 
indexing.  We  investigate  a  method  for  learning  visual 
concepts  that  encode  the  properties  of  a  scene  class 
from  a  small  set  of  positive  and  negative  examples. 
Extracted  concepts  are  simple  templates  that  capture 
some  color  and  spatial  properties  of  the  class.  Work 
by  Lipson  [Lipson  et  ai,  1997]  illustrates  that  sim¬ 


ple,  hand-crafted  templates  that  describe  the  relative 
color  and  spatial  properties  in  an  image  can  be  used 
successfully  to  classify  natural  scenes  like  fields,  snowy 
mountains  and  waterfalls.  In  this  paper  we  show  that 
these  templates  can  be  learned.  We  describe  a  frame¬ 
work  for  learning  scene-class  concepts  that  can  be  used 
effectively  for  the  task  of  content-based  image  retrieval 
from  large  databases.  The  learning  framework  we  use 
in  this  paper  is  called  Multiple-Instance  learning  [Di- 
etterich  et  al,  1997], [Maron  and  Lozano-Perez,  1998]. 
In  this  framework,  examples  are  not  labeled  examples, 
but  are  labeled  bags.  Each  bag  is  a  collection  of  in¬ 
stances  (Figure  1).  A  bag  is  labeled  negative  if  all  the 
instances  in  it  are  negative,  and  positive  if  at  least  one 
of  the  instances  in  it  is  positive.  We  use  this  framework 
to  model  the  ambiguity  in  mapping  an  image  to  many 
possible  templates  which  describe  the  image.  Specifi¬ 
cally,  every  image  is  a  bag,  and  each  possible  template 
for  describing  the  image  is  one  instance  in  the  bag. 
We  discuss  a  method  called  Diverse  Density  [Maron 
and  Lozano-Perez,  1998]  for  learning  concepts  from 
Multiple- Instance  examples. 

We  test  our  approach  on  images  from  the  COREL 
photo  library.  We  show  that  the  system  is  succesful 
even  when  the  hypothesis  class  involves  very  simple 
templates,  and  even  when  the  images  are  sampled  very 
coarsely.  In  addition,  we  show  that  user  interaction 
(refining  the  hypothesis  through  the  addition  of  more 
examples)  is  helpful  in  improving  the  performance  of 
the  learning  system.  In  Section  2,  we  discuss  previous 
and  related  work  in  image  classification.  We  then  de¬ 
scribe  the  Multiple-Instance  learning  framework  and 
the  Diverse  Density  algorithm.  In  section  4  we  de¬ 
tail  our  experimental  setup  and  show  results  on  var¬ 
ious  concept  classes,  hypothesis  classes,  and  training 
regimes. 

The  third  contribution  of  this  paper  (in  addition  to 
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a  novel  application  of  Multiple-Instance  learning  and 
the  discovery  that  surprisingly  simple  concepts  do  well 
on  this  task)  is  the  development  of  a  general  architec¬ 
ture  to  combine  ideas  from  the  vision  and  machine 
learning  communities.  A  key  part  of  our  system  is  the 
bag  generator:  a  mechanism  which  takes  an  image  and 
generates  a  set  of  instances,  where  each  instance  is  a 
possible  description  of  what  the  image  is  about.  If  an 
idealized  object  recognizer  existed,  then  the  bag  gen¬ 
erator  would  simply  output  a  list  of  the  objects  in  the 
image.  The  learning  algorithm  would  be  straightfor¬ 
ward:  find  an  intersection  between  the  positive  lists 
that  didn’t  include  elements  from  the  negative  lists. 
On  the  other  extreme,  if  we  had  a  learning  algorithm 
that  could  handle  billions  of  instances  per  bag,  then 
we  would  not  need  an  object  recognizer.  Instead,  the 
bag  generator  would  simply  output  every  subcombi¬ 
nation  of  pixels  in  the  image.  In  this  paper,  wc  use 
a  slightly  more  sophisticated  bag  generator  (one  that 
generates  subregions),  which  limits  the  number  of  in¬ 
stances  per  bag  and  therefore  allows  us  to  use  an  algo¬ 
rithm  such  as  Diverse  Density.  The  key  observation  is 
that  a  better  bag  generator  (progress  in  the  vision  com¬ 
munity)  leads  to  a  simpler  learning  algorithm,  while 
at  the  same  time  a  better  Multiple-Instance  learning 
algorithm  (progress  in  the  machine  learning  commu¬ 
nity)  allows  us  to  use  simpler  segmentation  algorithms. 
This  is  in  contrast  with  the  architecture  of  [Keeler  et 
al,  1991],  for  example,  where  the  learning  mechanism 
is  woven  into  the  position-invariant  representation  of 
subimages. 

2  IMAGE  CLASSIFICATION 
SYSTEMS 

In  the  past  few  years,  the  growing  number  of  digital 
image  and  video  libraries  has  led  to  the  need  for  flexi¬ 
ble,  automated  content-based  image  retrieval  systems 
which  can  efficiently  retrieve  images  from  a  database 
that  are  similar  to  a  user’s  query.  Because  what  a  user 
wants  can  vary  greatly,  we  also  want  to  provide  a  way 
for  the  user  to  explore  and  refine  the  query  by  letting 
the  system  bring  up  examples. 

One  of  the  most  popular  global  techniques  for  index¬ 
ing  is  color-histogramming  which  measures  the  over¬ 
all  distribution  of  colors  in  the  image.  While  his¬ 
tograms  are  useful  because  they  are  relatively  insensi¬ 
tive  to  position  and  orientation  changes,  they  do  not 
capture  the  spatial  relationships  of  color  regions  and 
thus  have  limited  discriminating  power.  Many  of  the 
existing  image-querying  systems  work  on  entire  im¬ 


ages  or  in  user-specified  regions  by  using  distribution 
of  color,  texture  and  structural  properties.  The  QBIC 
sy.stem  [Flickner  et  al,  1995]  is  an  example  of  such  a 
system.  Some  recent  systems  that  try  to  incorporate 
some  spatial  information  into  their  color  feature  sets 
include  [Smith  and  Chang,  1996,  Huang  et  al,  1997, 
Belongie  et  al.,  1998],  Promising  work  by  Rubner 
[Rubner  et  al,  1998]  on  the  earth  mover’s  distance 
provides  a  metric  that  overcomes  the  binning  problems 
of  existing  definitions  of  distribution  distances  for  in¬ 
dexing.  Most  of  these  techniques  require  the  user  to 
specify  the  salient  regions  in  the  query  image.  One  of 
the  goals  of  our  system  is  to  learn  the  relevant  color 
and  spatial  properties  that  best  describe  a  particular 
class  of  natural  scenes. 

More  recently,  work  by  Lipson  and  Sinha  ([Lipson  et 
al.,  1997])  in  scene  classification  illustrates  that  pre¬ 
defined  flexible  templates  that  describe  the  relative 
color  and  spatial  properties  in  the  image  can  be  used 
effectively  for  this  task.  The  flexible  templates  con¬ 
structed  by  Lipson  [Lipson  et  al,  1997]  encode  the 
scene  classes  as  a  set  of  image  patches  and  qualita¬ 
tive  relationships  between  those  patches.  Each  im¬ 
age  patch  has  properties  in  the  color  and  luminance 
channels.  These  templates  describe  the  color  relation¬ 
ship  (relative  changes  in  the  R,G,B  channels),  lumi¬ 
nance  relationship  (relative  changes  in  the  luminance 
channel)  and  spatial  relationship  between  two  image 
patches.  Lipson  hand-crafted  these  flexible  templates 
for  a  variety  of  scene  classes  and  showed  that  they 
could  be  used  to  classify  natural  scenes  of  fields,  wa¬ 
terfalls  and  snowy  mountains  efficiently  and  reliably. 
For  example,  the  following  concept  might  be  learned 
for  the  snowy-mountain  class:  “if  the  image  contains  a 
blue  blob  which  is  above  a  white  blob  which  is  above  a 
brown  blob,  then  it  is  a  mountain” ,  In  this  paper,  we 
would  like  to  learn  such  concepts  for  natural  images 
given  a  small  set  of  positive  and  negative  examples. 

All  of  the  systems  described  above  require  users  to 
specify  preci.sely  what  they  want.  Minka  and  Pi¬ 
card  [Minka  and  Picard,  1996]  introduced  a  learn¬ 
ing  component  in  their  system  by  using  positive  and 
negative  examples  which  let  the  system  choose  image 
groupings  within  and  across  images  based  on  color  and 
texture  cues;  however,  their  system  requires  the  user 
to  label  various  parts  of  the  scene,  where  as  our  system 
only  gets  a  label  for  the  entire  image  and  automatically 
extracts  the  relevant  parts  of  the  scene.  In  this  paper, 
we  focus  on  learning  natural  scene  concepts  by  extract¬ 
ing  color  and  spatial  relations  between  image  patches 
using  a  small  set  of  positive  and  negative  examples. 
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Our  system  uses  a  small  set  of  user-selected  positive 
and  negative  examples  to  learn  a  scene  concept  which 
is  used  to  retrieve  similar  images  from  the  database. 
The  system  also  lets  the  user  add  more  positive  and 
negative  examples  after  each  iteration  in  order  to  re¬ 
fine  the  concept. 

3  MULTIPLE-INSTANCE 
LEARNING 

In  traditional  supervised  learning,  a  learning  algorithm 
receives  a  training  set  which  consists  of  individually  la¬ 
beled  examples.  There  are  situations  where  this  model 
fails,  specifically,  when  the  teacher  cannot  label  indi¬ 
vidual  instances,  but  only  a  collection  of  instances.  For 
example,  given  a  picture  containing  a  waterfall,  what 
is  it  about  the  image  that  causes  it  to  be  labeled  as 
a  waterfall?  Is  it  the  butterfly  hovering  in  the  corner, 
the  blooming  flowers,  or  the  white  stream  of  water? 
It  is  impossible  to  tell  by  looking  at  only  one  image. 
The  best  we  can  say  is  that  at  least  one  of  the  ob¬ 
jects  in  the  image  is  a  waterfall.  Given  a  number  of 
images  (each  labeled  as  waterfall  or  non- waterfall),  we 
can  attempt  to  find  commonalities  within  the  waterfall 
images  that  do  not  appear  in  the  non-waterfall  images. 
Multiple-Instance  learning  is  a  way  of  formalizing  this 
problem,  and  Diverse  Density  is  a  method  for  finding 
the  commonality. 

In  Multiple-Instance  learning,  we  receive  a  set  of  hags, 
each  of  which  is  labeled  positive  or  negative.  Each 
bag  contains  many  instances,  where  each  instance  is  a 
point  in  feature  space.  A  bag  is  labeled  negative  if  all 
the  instances  in  it  are  negative.  On  the  other  hand,  a 
bag  is  labeled  positive  if  there  is  at  least  one  instance 
in  it  which  is  positive.  Prom  a  collection  of  labeled 
bags,  the  learner  tries  to  induce  a  concept  that  will 
label  unseen  bags  correctly.  This  problem  is  harder 
than  even  noisy  supervised  learning  because  the  ratio 
of  negative  to  positive  instances  in  a  positively-labeled 
bag  (the  noise  ratio)  can  be  arbitrarily  high. 

The  multiple-instance  learning  model  was  only  re¬ 
cently  formalized  by  [Dietterich  et  at,  1997],  where 
they  develop  algorithms  for  the  drug  activity  predic¬ 
tion  problem.  This  work  was  followed  by  [Long  and 
Tan,  1996,  Auer  et  al,  1996,  Blum  and  Kalai,  1998], 
who  showed  that  it  is  difficult  to  PAC-learn  in  the 
Multiple-Instance  model  unless  very  restrictive  inde¬ 
pendence  assumptions  are  made  about  the  way  in 
which  examples  are  generated.  [Auer,  1997]  shows 
that  despite  these  assumptions,  the  MULTINST  al¬ 
gorithm  performs  competitively  on  the  drug  activity 


prediction  problem.  [Maron  and  Lozano-Perez,  1998] 
develop  an  algorithm  called  Diverse  Density,  and  show 
that  it  performs  well  on  a  variety  of  problems  such  as 
drug  activity  prediction,  stock  selection,  and  learning 
a  description  of  a  person  from  a  series  of  images  that 
contain  that  person. 

3.1  MULTIPLE-INSTANCE  LEARNING 
FOR  SCENE  CLASSIFICATION 

In  this  paper,  each  training  image  is  a  bag.  The  in¬ 
stances  in  a  particular  bag  are  various  subimages.  If 
the  bag  is  labeled  as  a  waterfall  (for  example),  we  know 
that  at  least  one  of  the  subimages  (instances)  is  a  wa¬ 
terfall.  If  the  bag  is  labeled  as  a  non- waterfall,  we 
know  that  none  of  the  subimages  contains  a  waterfall. 
Each  of  the  instances,  or  subimages,  is  described  as  a 
point  in  some  feature  space.  As  discussed  in  section  4, 
we  experimented  with  several  ways  of  describing  an 
instance.  We  will  discuss  one  of  them  (single  blob 
with  neighbors)  in  detail:  a  subimage  is  a  2x2  set 
of  pixels  (referred  to  as  a  blob)  and  its  four  neighbor¬ 
ing  blobs  (up,  down,  left,  and  right).  The  subimage  is 
described  as  a  vector  [ri , 0:2, ... ,  X15],  where  xi ,  X2,  X3 
are  the  mean  RGB  values  of  the  central  blob,  X4,  X5,  xe 
are  the  differences  in  mean  RGB  values  between  the 
central  blob  and  the  blob  above  it,  etc.  One  bag  is 
therefore  a  collection  of  instances,  each  of  which  is  a 
point  in  a  15-dimensional  feature  space.  We  assume 
that  at  least  one  of  these  instances  is  the  template 
that  contains  the  waterfall. 

We  would  now  like  to  find  a  description  which  will 
correctly  classify  new  images  as  waterfalls  or  non¬ 
waterfalls.  This  can  be  done  by  finding  what  is  in 
common  between  the  waterfall  images  given  during 
training  and  the  differences  between  those  and  the 
non- waterfall  images.  The  main  idea  behind  the  Di¬ 
verse  Density  (DD)  algorithm  is  to  find  areas  in  feature 
space  that  are  close  to  at  least  one  instance  from  ev¬ 
ery  positive  bag  and  far  from  every  negative  instance. 
The  algorithm  searches  the  feature  space  for  points 
with  high  Diverse  Density.  Once  the  point  (or  points) 
with  maximum  DD  is  found,  a  new  image  is  classified 
positive  if  one  of  its  subimages  is  close  to  the  maximum 
DD  point.  As  seen  in  Section  4,  the  entire  database 
can  be  sorted  by  the  distance  to  the  learned  concept. 
Figure  1  is  a  schematic  of  how  the  system  works. 

In  the  following  subsection,  we  will  describe  a  deriva¬ 
tion  of  Diverse  Density  and  how  we  find  the  maximum 
in  a  large  feature  space.  We  will  also  show  that  the 
appropriate  scaling  of  the  feature  space  can  be  found 
by  maximizing  DD  not  just  with  respect  to  location  in 
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EXAMPLES  OF  OTHER  HYPOTHESIS  CLASSES 


ROW  BLOB  NO  NEIGHBS  BLOB  WITH  NEIGHBS  2  BLOB  NO  NEIGHBS  2  BLOB  WITH  NEIGHBS 

Figure  1;  System  Diagram 


feature  space,  but  also  with  respect  to  a  weighting  of 
each  of  the  features. 

3.2  DIVERSE  DENSITY 

In  this  section,  we  derive  a  probabilistic  measure  of 
Diverse  Density.  More  details  are  given  in  [Maron, 
1998].  We  denote  positive  bags  as  ,  and  the 
instance  in  that  bag  as  Likewise,  B~  repre¬ 

sents  an  instance  from  a  negative  bag.  For  simplic¬ 
ity,  let  us  assume  that  the  true  concept  is  a  single 
point  t  in  feature  space.  We  can  find  t  by  maximizing 
Pr(t  I  B^ ,  ■  •  • ,  R+,  Rf,  •  •  • ,  R“)  over  all  points  in  fea¬ 
ture  space.  Using  Bayes’  rule  and  a  uniform  prior  over 
the  concept  location,  we  see  that  this  is  equivalent  to 
maximizing  the  likelihood: 

argmaxPr(Bi+,---,R+,Bf  |  t).  (1) 

By  making  the  additional  assumption  that  the  bags  are 
conditionally  independent  given  the  target  concept  t, 
this  decomposes  into 

argmax]^Pr(B+  |  f)PJPr(B“  |  t)  (2) 

i  i 

which  is  equivalent  (by  similar  arguments  as  above)  to 
maximizing 

arg  max  ]][  Pr(t  I  I 

i  i 

This  is  a  general  definition  of  Diverse  Density,  but  we 
need  to  define  the  terms  in  the  products  to  instantiate 


it.  In  this  paper,  we  use  the  noi.sy-or  model  as  follows: 
Pr(f|R+)=:l-n(l-Pr(f|Rj)).  (4) 

3 

The  noi.sy-or  model  makes  two  assumptions:  one  is 
that  for  t  to  be  the  target  concept  it  is  caused  by 
(hence  close  to)  one  of  the  instances  in  the  bag.  It 
also  assumes  that  the  probability  of  instance  j  not  be¬ 
ing  the  target  is  independent  of  any  other  instance  not 
being  the  target. 

Finally,  we  estimate  the  distribution  Pr(f  |  R,^)  with 
a  Gau.ssian-like  distribution  of  exp(—  ||  B^j  —  t  p). 
A  negative  bag’s  contribution  is  likewise  computed  as 
Pr(f  I  B~)  —  HjCl  “  I  ^  supervised  learn¬ 

ing  algorithm  such  as  nearest-neighbor  or  kernel  re¬ 
gression  would  average  the  contribution  of  each  bag, 
computing  a  density  of  instances.  This  algorithm  com¬ 
putes  a  product  of  the  contribution  of  each  bag,  hence 
the  name  Diverse  Density.  Note  that  Diverse  Density 
at  an  intersection  of  n  bags  is  exponentially  higher 
than  it  is  at  an  intersection  of  n  -  1  bags,  yet  all  it 
takes  is  one  well  placed  negative  instance  to  drive  the 
Diverse  Density  down. 

The  initial  feature  space  is  probably  not  the  most 
suitable  one  for  finding  commonalities  among  images. 
Some  features  might  be  irrelevant  or  redundant,  while 
small  differences  along  other  features  might  be  crucial 
for  discriminating  between  positive  and  negative  ex¬ 
amples.  The  Diverse  Density  framework  allows  us  to 
find  the  best  weighting  on  the  initial  feature  set  in  the 
same  way  that  it  allows  us  to  find  an  appropriate  lo- 


Multiple-Instance  Learning  for  Natural  Scene  Classification  345 


cation  in  feature  space.  If  a  feature  is  irrelevant,  then 
removing  it  can  only  increase  the  DD  since  it  will  bring 
positive  instances  closer  together.  On  the  other  hand, 
if  a  relevant  feature  is  removed  then  negative  instances 
will  come  closer  to  the  best  DD  location  and  lower  it. 
Therefore,  a  feature’s  weight  should  be  changed  in  or¬ 
der  to  increase  DD.  Formally,  the  distance  between 
two  points  in  feature  space  (Bij  and  t)  is 

II  -  *  f  =  -  tkf  (5) 

k 

where  Bijk  is  the  value  of  the  feature  in  the 
point  in  the  bag,  and  Wk  is  a  non-negative  scaling 
factor.  If  Wk  is  zero,  then  the  feature  is  irrelevant. 
If  Wk  is  large,  then  the  feature  is  very  important. 
We  would  like  to  find  both  t  and  w  such  that  Diverse 
Density  is  maximized.  We  have  doubled  the  number 
of  dimensions  in  our  search  space,  but  we  now  have 
a  powerful  method  of  changing  our  representation  to 
accomodate  the  task. 

We  can  use  also  use  this  technique  to  learn  more  com¬ 
plicated  concepts  than  a  single  point.  To  learn  a  2- 
disjunct  concept  tVs,  we  maximize  Diverse  Density  as 
follows: 

arg  rnax  nci-rKi-p'O^^isy))) 
i  j 

1  j 

where  Pr(t  V  s  \  Bf'j)  is  estimated  as  max{Pr(t  | 
By),Pr(s  I  Bf'j)}.  Other  approximations  (such  as 
noisy-or)  are  also  possible. 

Finding  the  maximum  Diverse  Density  in  a  high¬ 
dimensional  space  is  a  difficult  problem.  In  general, 
we  are  searching  an  arbitrary  landscape  and  the  num¬ 
ber  of  local  maxima  and  size  of  the  search  space  could 
prohibit  any  efficient  exploration.  In  this  paper,  we 
use  gradient  ascent  (since  DD  is  a  differentiable  func¬ 
tion)  with  multiple  starting  points.  This  has  worked 
successfully  because  we  know  what  starting  points  to 
use.  The  maximum  DD  point  is  made  of  contributions 
from  some  set  of  positive  points.  If  we  start  an  ascent 
from  every  positive  point,  one  of  them  is  likely  to  be 
closest  to  the  maximum,  contribute  the  most  to  it  and 
have  a  climb  directly  to  it.  Therefore,  if  we  start  an 
ascent  from  every  positive  instance,  we  are  very  likely 
to  find  the  maximum  DD  point.  When  we  need  to  find 
both  the  location  and  the  scaling  of  the  concept,  we 
perform  gradient  ascent  for  both  sets  of  parameters  at 
the  same  time  (starting  with  all  scale  weightings  at 


1).  The  number  of  dimensions  in  our  search  space  has 
doubled,  though.  When  we  need  to  find  a  2-disjunct 
concept,  we  can  again  perform  gradient  ascent  for  all 
parameters  at  once.  This  carries  a  high  computational 
burden  because  the  number  of  dimensions  has  doubled, 
and  we  perform  a  gradient  ascent  starting  at  every  pair 
of  positive  instances. 

Our  goal  in  the  next  section  is  to  show  that:  (1) 
Multiple-Instance  learning  by  maximizing  diverse  den¬ 
sity  can  be  used  in  the  domain  of  natural  scene  classi¬ 
fication,  (2)  simple  concepts  in  low  resolution  images 
are  sufficient  to  learn  some  of  these  concepts  (3)  adding 
false  positives  and  false  negatives  over  mutiple  itera¬ 
tions  (user  interaction)  can  be  used  to  improve  the 
classifier  performance. 

4  EXPERIMENTS 

In  this  section,  we  show  four  different  types  of  results 
from  running  the  system:  one  is  that  Multiple-Instance 
learning  is  applicable  to  this  domain.  A  second  result 
is  that  one  does  not  need  very  complicated  hypoth¬ 
esis  classes  to  learn  concepts  from  the  natural  image 
domain.  We  also  compare  the  performance  of  various 
hypotheses,  including  the  global  histogram  method. 
Finally,  we  show  how  user  interaction  would  work  to 
improve  the  classifier. 

4.1  EXPERIMENTAL  SETUP 

We  tried  to  learn  three  different  concepts:  waterfall, 
mountain,  and  field.  For  training  and  testing  we  used 
natural  images  from  the  COREL  library,  and  the  la¬ 
bels  given  by  COREL.  These  included  100  images  from 
each  of  the  following  classes:  waterfalls,  fields,  moun¬ 
tains,  sunsets  and  lakes.  We  also  used  a  larger  test  set 
of  2600  natural  images  from  various  classes. 

We  created  a  potential  training  set  that  consisted  of  20 
randomly  chosen  images  from  each  of  the  five  classes 
mentioned  above.  This  left  us  with  a  small  test  set 
consisting  of  the  remaining  80  images  from  each  of 
the  five  classes.  We  seperated  the  potential  training 
set  from  the  testing  set  to  insure  that  results  of  using 
various  training  schemes  and  hypothesis  classes  can  be 
compared  fairly.  Finally  the  large  test  set  contained 
2600  natural  images  from  a  large  variety  of  classes. 

For  a  given  concept,  we  create  an  initial  training  set 
by  picking  five  positive  examples  of  the  concept  and 
five  negative  examples,  all  from  the  potential  training 
set.  After  the  concept  is  learned  from  these  exam¬ 
ples  (by  finding  the  point  in  and  scaling  of  feature 
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space  with  maximum  DD),  the  unused  90  images  in 
the  potential  training  set  are  sorted  by  distance  from 
the  learned  concept*.  This  sorted  list  can  be  used  to 
simulate  what  a  user  would  select  as  further  refining 
examples.  Specifically,  the  most  egregious  false  posi¬ 
tives  (the  non-concept  images  at  the  beginning  of  the 
sorted  list)  and  the  most  egregious  false  negatives  (the 
concept  images  at  the  end  of  the  sorted  list)  would 
likely  be  picked  by  the  user  as  additional  negative  and 
positive  examples. 

We  attempted  four  different  training  schemes: 
initial  is  simply  using  the  initial  five  positives  and 
five  negative  examples.  +5fp  adds  the  five  most  egre¬ 
gious  false  positives.  +10fp  repeats  the  +5fp  scheme 
twice.  +3fp+2fn  adds  3  false  positives  and  2  false  neg¬ 
atives. 

All  images  were  smoothed  using  a  gaussian  filter  and 
subsampled  to  8  x  8.  We  used  the  RGB  color  space 
in  these  experiments.  For  every  class  and  for  every 
training  scheme,  we  tried  to  learn  the  concept  using 
one  of  seven  hypothesis  classes  (Figure  1  shows  some 
examples); 

1.  row:  an  instance  is  the  row’s  mean  color  and  the 
color  difference  in  the  rows  above  and  below  it. 

2.  single  blob  with  neighbors:  an  instance  is  the 
mean  color  of  a  2  x  2  blob  and  the  color  difference  with 
its  4  neighboring  blobs. 

3.  single  blob  with  no  neighbors:  an  instance  is 
the  color  of  each  of  the  pixels  in  a  2  x  2  blob. 

4.  disjunctive  blob  with  neighbors:  an  instance 
is  the  same  as  the  single  blob  with  neighbors  but  the 
concept  learned  is  a  disjunction  of  two  single  blob  con¬ 
cepts. 

5.  disjunctive  blob  with  no  neighbors:  an  in¬ 
stance  is  the  same  as  the  single  blob  with  no  neighbors 
but  the  concept  learned  is  a  disjunction  of  two  single 
blob  concepts. 

6.  two  blob  with  neighbors:  an  instance  is  the 
mean  color  of  two  descriptions  of  two  single  blob 
with  neighbors  and  their  relative  spatial  relation¬ 
ship  (whether  the  second  blob  is  above  or  below,  and 
whether  it  is  to  the  left  or  right,  of  the  first  blob). 

7.  two  blob  with  no  neighbors:  an  instance  is  the 
mean  color  of  two  descriptions  of  two  single  blob 
with  no  neighbors  and  their  relative  spatial  rela¬ 
tionship. 

Learning  a  concept  took  anywhere  from  a  few  sec- 

*An  image/bag’s  distance  from  the  concept  is  the  min¬ 
imum  distance  of  any  of  the  image’s  subregions/instances 
from  the  concept. 


Figure  2:  Comparison  of  learned  concept  (solid  curves) 
with  hand-crafted  templates  (dashed  curves)  for  the 
mountain  concept  on  240  images  from  the  small  test 
set.  The  top  and  bottom  dashed  precision-recall  curves 
indicate  the  best-case  and  worst-case  curves  for  the 
first  32  images  retrieved  by  the  hand-crafted  template 
which  all  have  the  same  score. 

onds  for  the  simple  hypotheses  to  a  few  days  for  the 
2-blob  and  disjunctive  hypotheses.  The  more  compli¬ 
cated  hypothe.ses  take  longer  to  learn  because  of  the 
higher  number  of  features  and  because  the  number  of 
instances  per  bag  is  large  (and  to  find  the  maximum 
DD  point,  we  perform  a  gradient  ascent  from  every 
positive  instance).  Because  this  is  a  prototype,  we 
have  not  tried  to  optimize  the  running  time;  however, 
a  more  intelligent  method  of  generating  instances  (for 
example,  a  rough  segmentation  using  connected  com¬ 
ponents)  will  reduce  both  the  number  of  instances  and 
the  running  time  by  orders  of  magnitude. 

4.2  RESULTS 

In  this  section  we  show  results  of  testing  the  vari¬ 
ous  hypothesis  classes,  training  schemes,  and  concept 
classes  against  the  small  test  set  and  the  larger  one. 
The  small  test  set  does  not  intersect  the  potential 
training  set,  and  therefore  more  accurately  represents 
the  generalization  of  the  learned  concepts.  The  large 
test  set  is  meant  to  show  how  the  system  scales  to 
larger  image  databases. 
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The  graphs  shown  are  precision-recall  and  recall 
curves.  Precision  is  the  ratio  of  the  number  of  correct 
images  to  the  number  of  images  seen  so  far.  Recall  is 
the  ratio  of  the  number  of  correct  images  to  the  total 
number  of  correct  images  in  the  test  set.  For  example, 
in  Figure  3,  the  waterfall  precision-recall  curve  has  re¬ 
call  0.5  with  precision  of  about  0.7,  which  means  in 
order  to  retrieve  40  of  the  80  waterfalls,  30%  of  the 
images  retrieved  are  not  waterfalls.  We  show  both 
curves  for  because  (1)  the  beginning  of  the  precision- 
recall  is  of  interest  to  applications  where  only  the  top 
few  objects  are  of  importance,  and  (2)  the  middle  of 
the  recall  curve  is  of  interest  to  applications  where  cor¬ 
rect  classification  of  a  large  percentage  of  the  database 
is  important. 

Figure  2  shows  that  the  performance  of  the  learned 
mountain  concept  is  competitive  with  a  hand-crafted 
mountain  template  (from  [Lipson  et  ai,  1997]^).  The 
test  set  consists  of  80  mountains,  80  fields,  and  80 
waterfalls.  It  is  disjoint  from  the  training  set.  The 
hand-crafted  model’s  precision-recall  curve  is  flat  at 
84%  because  the  first  32  images  all  receive  the  same 
score,  and  27  of  them  are  mountains.  We  also  show 
the  curves  if  we  were  to  retrieve  the  27  mountains  first 
(best-case)  or  after  the  first  five  false  positives  (worst- 
case). 

In  Figure  3,  we  show  the  performance  of  the  best  hy¬ 
pothesis  and  training  method  on  each  concept  class. 
The  dashed  lines  show  the  poor  performance  of  the 
global  histogram  method.  The  solid  lines  in  the 
precision-recall  graph  show  the  performance  of  single 
blob  with  neighbors  with  +10fp  for  waterfalls,  row 
with  +10fp  for  fields,  and  disjunctive  blob  with 
no  neighbors  with  +10fp  for  mountains.  The  solid 
lines  in  the  recall  curve  show  the  performance  of  the 
single  blob  with  neighbors  with  +10fp  for  water¬ 
falls,  single  blob  with  neighbors  with  +3fp+2fn 
for  fields,  and  row  with  +3f  p+2fn  for  mountains.  This 
behavior  continues  for  the  larger  test  set. 

In  Figure  4,  we  show  the  precision-recall  curves  for 
each  of  the  four  training  schemes.  We  average  over 
all  concepts  and  all  hypothesis  classes.  We  see  that 
performance  improves  with  user  interaction.  This  be¬ 
havior  continues  for  the  larger  test  set  as  well. 

In  Figure  5,  we  show  the  precision-recall  and  recall 
curves  for  each  of  the  seven  hypotheses  averaged  over 
all  concepts  and  all  training  schemes.  Note  that  these 
curves  are  for  the  larger  2600  image  database.  We 


^Lipson’s  classifier  was  modified  to  give  a  ranking  of 
each  image,  rather  than  its  class. 


Recall 


Figure  3:  The  best  curves  for  each  concept  using 
a  small  test  set.  Dashed  curves  are  the  global  his¬ 
togram’s  performance. 


see  that  the  single  blob  with  neighbors  hypothesis  has 
good  precision.  We  also  see  that  the  more  compli¬ 
cated  hypothesis  classes  (i.e.  the  disjunctive  concepts 
and  the  two-blob  concepts)  tend  to  have  better  recall 
curves. 

In  Figure  6,  we  show  a  snapshot  of  the  system  in 
action.  The  system  is  trained  using  training  scheme 
+10f  p  for  the  waterfall  concept.  It  has  learned  a  water¬ 
fall  concept  using  the  single  blob  with  neighbors 
hypothesis.  The  learned  waterfall  concept  is  that 
somewhere  in  the  image  there  is  a  blob  whose  left 
neighbor  is  less  blue,  whose  own  blue  value  is  0.5 
(where  RGB  values  are  in  the  [0, 1]  cube),  whose  neigh¬ 
bor  below  has  the  same  blue  value,  whose  neighbor 
above  has  the  same  red  value,  whose  green  value  is 
0.55,  whose  neighbor  above  has  the  same  blue  value 
and  whose  red  value  is  0.47.  These  properties  are 
weighted  in  the  order  given,  and  any  other  features 
were  found  to  be  irrelevant.  A  new  image  has  the  rat¬ 
ing  of  the  minimum  distance  of  one  of  its  instances  to 
the  learned  concept,  where  the  distance  metric  uses 
the  learned  scaling  to  account  for  the  importance  of 
the  relevant  features.  As  we  can  see  in  the  figure,  this 
simple  learned  concept  is  able  to  retrieve  a  wide  variety 
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Precision 


Figure  4:  Different  training  schemes,  averaged  over 
concept  and  hypothesis  class,  using  a  small  test  set. 


of  waterfall  scenes. 

The  top  20  images  in  the  figure  arc  the  training  set. 
The  first  10  images  are  the  initial  positive  and  negative 
examples  used  in  training.  The  next  10  images  are  the 
false  positives  added.  The  last  30  images  are  the  top 
30  returned  from  the  large  data.set. 

5  CONCLUSIONS 

In  this  paper,  we  have  shown  that  Multiple-In.stance 
learning  by  maximizing  diverse  density  can  be  u.sed 
to  classify  images  of  natural  scenes.  Our  results  are 
competitive  with  hand-crafted  models,  and  much  bet¬ 
ter  than  a  global  histogram  approach.  We  have  also 
demonstrated  that  simple  learned  concepts  that  cap¬ 
ture  color  relations  in  low  resolution  images  can  be 
used  effectively  in  the  domain  of  natural  scene  classi¬ 
fication.  Our  experiments  indicate  that  complicated 
concepts  (e.g.  disjunctive  concepts)  tend  to  have  bet¬ 
ter  recall  curves  and  that  user  interaction  (adding  false 
positives  and  false  negatives)  over  multiple  iterations 
can  improve  the  performance  of  the  classifier.  Our  ar¬ 
chitecture,  by  seperating  the  bag  generator  from  the 
learning  mechanism,  allows  progress  in  the  field  of 
computer  vision  to  benefit  the  field  of  machine  learning 
and  vice  versa. 
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Figure  5:  Different  hypothesis  classes  averaged  over  concept  and  training  scheme,  using  a  large  test  set  with 
2600  images. 


Figure  6:  Results  for  the  waterfall  concept  using  the  single  blob  with  neighbors  concept  with  +10fp.  Top 
row;  Initial  training  set-5  positive  and  5  negative  examples.  Second  Row:  Additional  false  positives.  Last  three 
rows:  Top  30  matches  retrieved  from  the  large  test  set.  The  red  squares  indicate  where  the  closest  instance  to 
the  learned  concept  is  located. 
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Abstract 

This  paper  shows  how  a  text  classifier’s  need 
for  labeled  training  documents  can  be  re¬ 
duced  by  taking  advantage  of  a  large  pool 
of  unlabeled  documents.  We  modify  the 
Query-by-Committee  (QBC)  method  of  ac¬ 
tive  learning  to  use  the  unlabeled  pool  for 
explicitly  estimating  document  density  when 
selecting  examples  for  labeling.  Then  ac¬ 
tive  learning  is  combined  with  Expectation- 
Maximization  in  order  to  “fill  in”  the  class 
labels  of  those  documents  that  remain  unla¬ 
beled.  Experimental  results  show  that  the 
improvements  to  active  learning  require  less 
than  two-thirds  as  many  labeled  training  ex¬ 
amples  as  previous  QBC  approaches,  and 
that  the  combination  of  EM  and  active  learn¬ 
ing  requires  only  slightly  more  than  half  as 
many  labeled  training  examples  to  achieve 
the  same  accuracy  as  either  the  improved  ac¬ 
tive  learning  or  EM  alone. 


1  Introduction 

Obtaining  labeled  training  examples  for  text  classifica¬ 
tion  is  often  expensive,  while  gathering  large  quantities 
of  unlabeled  examples  is  usually  very  cheap.  For  ex¬ 
ample,  consider  the  task  of  learning  which  web  pages 
a  user  finds  interesting.  The  user  may  not  have  the 
patience  to  hand-label  a  thousand  training  pages  as 
interesting  or  not,  yet  multitudes  of  unlabeled  pages 
are  readily  available  on  the  Internet. 

This  paper  presents  techniques  for  using  a  large  pool 
of  unlabeled  documents  to  improve  text  classification 
when  labeled  training  data  is  sparse.  We  enhance  the 
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QBC  active  learning  algorithm  to  select  labeling  re¬ 
quests  from  the  entire  pool  of  unlabelcd  documents, 
and  explicitly  use  the  pool  to  estimate  regional  doc¬ 
ument  density.  We  also  combine  active  learning  with 
Expectation-Maximization  (EM)  in  order  to  take  ad¬ 
vantage  of  the  word  co-occurrence  information  con¬ 
tained  in  the  many  documents  that  remain  in  the  un¬ 
labeled  pool. 

In  previous  work  [Nigam  et  al.  1998]  we  show  that 
combining  the  evidence  of  labeled  and  unlabeled  doc¬ 
uments  via  EM  can  reduce  text  classification  error  by 
one-third.  We  treat  the  absent  labels  as  “hidden  vari¬ 
ables”  and  use  EM  to  fill  them  in.  EM  improves  the 
classifier  by  alternately  using  the  current  classifier  to 
guess  the  hidden  variables,  and  then  using  the  cur¬ 
rent  guesses  to  advance  classifier  training.  EM  con¬ 
sequently  finds  the  classifier  parameters  that  locally 
maximize  the  probability  of  both  the  labeled  and  un¬ 
labeled  data. 

Active  learning  approaches  this  same  problem  in  a  dif¬ 
ferent  way.  Unlike  our  EM  setting,  the  active  learner 
can  request  the  true  class  label  for  certain  unlabeled 
documents  it  selects.  However,  each  request  is  consid¬ 
ered  an  expensive  operation  and  the  point  is  to  per¬ 
form  well  with  as  few  queries  as  possible.  Active  learn¬ 
ing  aims  to  select  the  most  informative  examples  -in 
many  settings  defined  as  those  that,  if  their  class  la¬ 
bel  were  known,  would  maximally  reduce  classifica¬ 
tion  error  and  variance  over  the  distribution  of  exam¬ 
ples  [Cohn,  Ghahramani,  &  Jordan  1996].  When  cal¬ 
culating  this  in  closed-form  is  prohibitively  complex, 
the  Query-by-Committee  (QBC)  algorithm  [Freund  et 
al.  1997]  can  be  used  to  select  documents  that  have 
high  classification  variance  themselves.  QBC  measures 
the  variance  indirectly,  by  examining  the  disagreement 
among  cla.ss  labels  assigned  by  a  set  of  classifier  vari¬ 
ants,  sampled  from  the  probability  distribution  of  clas- 
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sifiers  that  results  from  the  labeled  training  examples. 

This  paper  shows  that  a  pool  of  unlabeled  examples 
can  be  used  to  benefit  both  active  learning  and  EM. 
Rather  than  having  active  learning  choose  queries  by 
synthetically  generating  them  (which  is  awkward  with 
text),  or  by  selecting  examples  from  a  stream  (which 
inefficiently  models  the  data  distribution),  we  advo¬ 
cate  selecting  the  best  examples  from  the  entire  pool 
of  unlabeled  documents  (and  using  the  pool  to  explic¬ 
itly  model  density);  we  call  this  last  scheme  pool-based 
sampling.  In  experimental  results  on  a  real-world  text 
data  set,  this  technique  is  shown  to  reduce  the  need 
for  labeled  documents  by  42%  over  previous  QBC  ap¬ 
proaches.  Furthermore,  we  show  that  the  combination 
of  QBC  and  EM  learns  with  fewer  labeled  examples 
than  either  individually— requiring  only  58%  as  many 
labeled  examples  as  EM  alone,  and  only  26%  as  many 
as  QBC  alone.  We  also  discuss  our  initial  approach  to 
a  richer  combination  we  call  pool-leveraged  sampling 
that  interleaves  active  learning  and  EM  such  that  EM’s 
modeling  of  the  unlabeled  data  informs  the  selection 
of  active  learning  queries. 


ponent  generate  a  document  according  to  its  own  pa¬ 
rameters,  with  distribution  P{di\cj-,6).  We  can  char¬ 
acterize  the  likelihood  of  a  document  as  a  sum  of  total 
probability  over  all  generative  components: 

|C| 

Pidi\e)=:'£'^{cj\e)P{di\cf,e).  (i) 

j=i 

Document  di  is  considered  to  be  an  ordered  list  of  word 
events.  We  write  Wdi^  for  the  word  in  position  k  of  doc¬ 
ument  di,  where  the  subscript  of  w  indicates  an  index 
into  the  vocabulary  V  =  {wi,W2,  ■  ■  .,w\v\)-  We  make 
the  standard  naive  Bayes  assumption:  that  the  words 
of  a  document  are  generated  independently  of  context, 
that  is,  independently  of  the  other  words  in  the  same 
document  given  the  class.  We  further  assume  that  the 
probability  of  a  word  is  independent  of  its  position 
within  the  document.  Thus,  we  can  express  the  class- 
conditional  probability  of  a  document  by  taking  the 
product  of  the  probabilities  of  the  independent  word 
events: 


2  Probabilistic  Framework  for  Text 
Classification 


Idd 

P(di|c,;0)  =  P(|di|)nPK<Jci;e), 


*=i 


(2) 


This  section  presents  a  Bayesian  probabilistic  frame¬ 
work  for  text  classification.  The  next  two  sections  add 
EM  and  active  learning  by  building  on  this  frame¬ 
work.  We  approach  the  task  of  text  classification 
from  a  Bayesian  learning  perspective:  we  assume  that 
the  documents  are  generated  by  a  particular  paramet¬ 
ric  model,  and  use  training  data  to  calculate  Bayes- 
optimal  estimates  of  the  model  parameters.  Then,  we 
use  these  estimates  to  classify  new  test  documents  by 
turning  the  generative  model  around  with  Bayes’  rule, 
calculating  the  probability  that  each  class  would  have 
generated  the  test  document  in  question,  and  selecting 
the  most  probable  class. 

Our  parametric  model  is  naive  Bayes,  which  is 
based  on  commonly  used  assumptions  [Friedman  1997; 
Joachims  1997].  First  we  assume  that  text  documents 
are  generated  by  a  mixture  model  (parameterized  by 
6),  and  that  there  is  a  one-to-one  correspondence  be¬ 
tween  the  (observed)  class  labels  and  the  mixture  com¬ 
ponents.  We  use  the  notation  c j  G  C  =  {ci,  ...,C|c|}  to 
indicate  both  the  jth  component  and  jth  class.  Each 
component  Cj  is  parameterized  by  a  disjoint  subset 
of  6.  These  assumptions  specify  that  a  document  is 
created  by  (1)  selecting  a  class  according  to  the  prior 
probabilities,  P(cjl0),  then  (2)  having  that  class  com- 


where  we  assume  the  length  of  the  document,  |di|, 
is  distributed  independently  of  class.  Each  individ¬ 
ual  class  component  is  parameterized  by  the  collection 
of  word  probabilities,  such  that  9,ut\cj  =  P(w^t|cj;0), 
where  t  G  {l,i..,lF|}  and  ^tP(wt|cj;0)  =  1.  The 
other  parameters  of  the  model  are  the  class  prior  prob¬ 
abilities  9cj  =  P(cj\9),  which  indicate  the  probabilities 
of  selecting  each  mixture  component. 

Given  these  underlying  assumptions  of  how  the  data 
are  produced,  the  task  of  learning  a  text  classifier  con¬ 
sists  of  forming  an  estimate  of  9,  written  9,  based  on  a 
set  of  training  data.  With  labeled  training  documents, 
V  =  {di, . . . ,  d|x)|},  we  can  calculate  estimates  for  the 
parameters  of  the  model  that  generated  these  docu¬ 
ments.  To  calculate  the  probability  of  a  word  given 
a  class,  9y,,\cj,  simply  count  the  fraction  of  times  the 
word  occurs  in  the  data  for  that  class,  augmented  with 
a  Laplacean  prior.  This  smoothing  prevents  probabil¬ 
ities  of  zero  for  infrequently  occurring  words.  These 
word  probability  estimates  axe: 


1^1  +  Eil'i  El=i  N{w„di)P{cj\diy 


(3) 
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where  N{wt,di)  is  the  count  of  the  number  of  times 
word  Wt  occurs  in  document  dj,  and  where  P(cj|(ii)  e 
{0, 1},  given  by  the  class  label.  The  class  prior  proba¬ 
bilities,  9cj ,  are  estimated  in  the  same  fashion  of  count¬ 
ing,  but  without  smoothing: 


El=l  Pfeirfi) 

|P| 


(4) 


Given  estimates  of  these  parameters  calculated  from 
the  training  documents,  it  is  possible  to  turn  the  gener¬ 
ative  model  around  and  calculate  the  probability  that 
a  particular  class  component  generated  a  given  docu¬ 
ment.  We  formulate  this  by  an  application  of  Bayes’ 
rule,  and  then  substitutions  using  Equations  1  and  2: 


P(cj|di;0) 


El"iiP(cr|^)nL1i'iP(«^<i..Jc.;0)‘ 


If  the  task  is  to  classify  a  test  document  di  into  a  single 
class,  simply  select  the  class  with  the  highest  posterior 
probability:  argmaxj  P(cj|di;  0). 

Note  that  our  assumptions  about  the  generation  of 
text  documents  are  all  violated  in  practice,  and  yet 
empirically,  naive  Bayes  does  a  good  job  of  clas¬ 
sifying  text  documents  [Lewis  &  Ringuette  1994; 
Craven  et  al.  1998;  Joachims  1997].  This  para¬ 
dox  is  explained  by  the  fact  that  classification  es¬ 
timation  is  only  a  function  of  the  sign  (in  binary 
cases)  of  the  function  estimation  [Friedman  1997; 
Domingos  &  Pazzani  1997].  Also  note  that  our  for¬ 
mulation  of  naive  Bayes  assumes  a  multinomial  event 
model  for  documents;  this  generally  produces  better 
text  classification  accuracy  than  another  formulation 
that  assumes  a  multi-variate  Bernoulli  [McCallum  & 
Nigam  1998]. 


3  EM  and  Unlabeled  Data 

When  naive  Bayes  is  given  just  a  small  set  of  labeled 
training  data,  classification  accuracy  will  suffer  be¬ 
cause  variance  in  the  parameter  estimates  of  the  gen¬ 
erative  model  will  be  high.  However,  by  augmenting 
this  small  set  with  a  large  set  of  unlabeled  data  and 
combining  the  two  pools  with  EM,  we  can  improve  the 
parameter  estimates.  This  section  describes  how  to 
use  EM  to  combine  these  pools  within  the  probabilistic 
framework  of  the  previous  section. 

EM  is  a  class  of  iterative  algorithms  for  maximum  like¬ 
lihood  estimation  in  problems  with  incomplete  data 


[Dempster,  Laird,  &  Rubin  1977].  Given  a  model  of 
data  generation,  and  data  with  some  missing  values, 
EM  alternately  uses  the  current  model  to  estimate  the 
missing  values,  and  then  uses  the  missing  value  esti¬ 
mates  to  improve  the  model.  Using  all  the  available 
data,  EM  will  locally  maximize  the  likelihood  of  the 
generative  parameters,  giving  estimates  for  the  miss¬ 
ing  values. 

In  our  text  classification  setting,  we  treat  the  class  la¬ 
bels  of  the  unlabeled  documents  as  missing  values,  and 
then  apply  EM.  The  resulting  naive  Bayes  parameter 
estimates  often  give  significantly  improved  classifica¬ 
tion  accuracy  on  the  test  set  when  the  pool  of  labeled 
examples  is  small  [Nigam  et  al.  1998].^  This  use  of 
EM  is  a  special  case  of  a  more  general  missing  values 
formulation  [Ghahramani  &  Jordan  1994]. 

In  implementation,  EM  is  an  iterative  two-step  pro¬ 
cess.  The  E-step  calculates  probabilistically-weighted 
class  labels,  P{cj\di;9),  for  every  unlabeled  document 
using  a  current  estimate  of  9  and  Equation  5.  The  M- 
step  calculates  a  new  maximum  likelihood  estimate  for 
9  using  all  the  labeled  data,  both  original  and  proba¬ 
bilistically  labeled,  by  Equations  3  and  4.  We  initialize 
the  process  with  parameter  estimates  using  just  the  la¬ 
beled  training  data,  and  iterate  until  9  reaches  a  fixed 
point.  See  [Nigam  et  al.  1998]  for  more  details. 

4  Active  Learning  with  EM 

Rather  than  estimating  class  labels  for  unlabeled  doc¬ 
uments,  as  EM  does,  active  learning  instead  requests 
the  true  class  labels  for  unlabeled  documents  it  selects. 
In  many  settings,  an  optimal  active  learner  should  se¬ 
lect  those  documents  that,  when  labeled  and  incorpo¬ 
rated  into  training,  will  minimize  classification  error 
over  the  distribution  of  future  documents.  Equiva¬ 
lently  in  probabilistic  frameworks  without  bias,  active 
learning  aims  to  minimize  the  expected  classification 
variance  over  the  document  distribution.  Note  that 
Naive  Bayes’  independence  assumption  and  Laplacean 
priors  do  introduce  bias.  However,  variance  tends  to 
dominate  bias  in  classification  error  [Friedman  1997], 
and  thus  we  focus  on  reducing  variance. 

The  Query-by-Committee  (QBC)  method  of  active 
learning  measures  this  variance  indirectly  [Freund  et 
al.  1997].  It  samples  several  times  from  the  classifier 
parameter  distribution  that  results  from  the  training 

^  When  the  classes  do  not  correspond  to  the  natural  clus¬ 
ters  of  the  data,  EM  can  hurt  accuracy  instead  of  helping. 
Our  previous  work  also  describes  a  method  for  avoiding 
these  detrimental  effects. 
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data,  in  order  to  create  a  “committee”  of  classifier  vari¬ 
ants.  This  committee  approximates  the  entire  classi¬ 
fier  distribution.  QBC  then  classifies  unlabeled  docu¬ 
ments  with  each  committee  member,  and  measures  the 
disagreement  between  their  classifications — thus  ap¬ 
proximating  the  classification  variance.  Finally,  docu¬ 
ments  on  which  the  committee  disagrees  strongly  are 
selected  for  labeling  requests.  The  newly  labeled  doc¬ 
uments  are  included  in  the  training  data,  and  a  new 
committee  is  sampled  for  making  the  next  set  of  re¬ 
quests.  This  section  presents  each  of  these  steps  in 
detail,  and  then  explains  its  integration  with  EM.  Our 
implementation  of  this  algorithm  is  summarized  in  Ta¬ 
ble  1. 

Our  committee  members  are  created  by  sampling  clas¬ 
sifiers  according  to  the  distribution  of  classifier  param¬ 
eters  specified  by  the  training  data.  Since  the  prob¬ 
ability  of  the  naive  Bayes  parameters  for  each  class 
are  described  by  a  Dirichlet  distribution,  we  sample 
the  parameters  the  posterior  Dirichlet  dis¬ 

tribution  based  on  training  data  word  counts,  N(-,-). 
This  is  performed  by  drawing  weights,  Vtj,  for  each 
word  Wt  and  class  cj  firom  the  Gamma  distribution: 
Vtj  =  Gamma(at  +  N{wt,Cj)),  where  at  is  always 
1,  as  specified  by  our  Laplacean  prior.  Then  we  set 
the  parameters  to  the  normalized  weights  by 

^Wt\c-  =  'i’tj /  ■  We  sample  to  create  a  classifier  k 

times,  resulting  in  k  committee  members.  Individual 
committee  members  are  denoted  by  m. 

We  consider  two  metrics  for  measuring  committee  dis¬ 
agreement.  The  previously  employed  vote  entropy  [Da- 
gan  &  Engelson  1995]  is  the  entropy  of  the  class  la¬ 
bel  distribution  resulting  from  having  each  commit¬ 
tee  member  “vote”  with  probability  mass  1/fc  for  its 
winning  class.  One  disadvantage  of  vote  entropy  is 
that  it  does  not  consider  the  confidence  of  the  com¬ 
mittee  members’  classifications,  as  indicated  by  the 
class  probabilities  from  each  member. 

To  capture  this  information,  we  propose  to  mea¬ 
sure  committee  disagreement  for  each  document  us¬ 
ing  Kullback-Leibler  divergence  to  the  mean  [Pereira, 
Tishby,  &  Lee  1993].  Unlike  vote  entropy,  which  com¬ 
pares  only  the  committee  members’  top  ranked  class, 
KL  divergence  measures  the  strength  of  the  certainty 
of  disagreement  by  calculating  differences  in  the  com¬ 
mittee  members’  class  distributions,  Pm(C'Mi)-^  Each 


^While  naive  Bayes  is  not  an  accurate  probability  esti¬ 
mator  [Domingos  &  Pazzani  1997],  naive  Bayes  classifica¬ 
tion  scores  are  somewhat  correlated  to  confidence;  the  fact 
that  naive  Bayes  scores  can  be  successfully  used  to  make 
accuracy /coverage  trade-offs  is  testament  to  this. 


•  Calculate  the  density  for  each  document.  (Eq.  9) 

•  Loop  while  adding  documents: 

-  Build  an  initial  estimate  of  0  from  the  labeled  docu¬ 
ments  only.  (Eqs.  3  and  4) 

-  Loop  jfc  times,  once  for  each  committee  member; 

-I-  Create  a  committee  member  by  sampling  for 
each  class  from  the  appropriate  Dirichlet  distri¬ 
bution. 

+  Starting  with  the  sampled  classifier  apply  EM 
with  the  nnlabeled  data.  Loop  while  parameters 
change: 

■  Use  the  current  classifier  to  probabilistically 
label  the  unlabeled  documents.  (Eq.  5) 

■  Recalculate  the  classifier  parameters  given 
the  probabilistically-weighted  labels.  (Eqs.  3 
and  4) 

+  Use  the  current  classifier  to  probabilistically  la¬ 
bel  all  unlabeled  documents.  (Eq.  5) 

-  Calculate  the  disagreement  for  each  unlabeled  docu¬ 
ment  (Eq.  7),  multiply  by  its  density,  and  request  the 
class  label  for  the  one  with  the  highest  score. 

•  Build  a  classifier  with  the  labeled  data.  (Eqs.  3  and  4). 

•  Starting  with  this  classifier,  apply  EM  as  above. _ 

Table  1:  Our  active  learning  algorithm.  Traditional  Query- 
by-Committee  omits  the  EM  steps,  indicated  by  italics, 
does  not  use  the  density,  and  works  in  a  stream-based  set¬ 
ting. 

committee  member  m  produces  a  posterior  class  distri¬ 
bution,  Pm(C'|di),  where  C  is  a  random  variable  over 
classes.  KL  divergence  to  the  mean  is  an  average  of 
the  KL  divergence  between  each  distribution  and  the 
mean  of  all  the  distributions: 

i;^£»(P,„(q(ii)||P„„s(C'|di)),  (6) 

m=l 

where  Pavg(Cjdi)  is  the  class  distribution  mean 
over  all  committee  members,  m:  Pavg((^\di)  = 
(EMC\di))/k. 

KL  divergence,  D(-||-),  is  ein  information-theoretic 
measure  of  the  difference  between  two  distributions, 
capturing  the  number  of  extra  “bits  of  information” 
required  to  send  messages  sampled  from  the  first  dis¬ 
tribution  using  a  code  that  is  optimal  for  the  second. 
The  KL  divergence  between  distributions  Pi(C')  and 
P2(C')  is: 

1^1  /p  Cc  "(N 

D(Pi(C)||P2(C'))  =  ^Pi(cj)log  •  (7) 
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After  disagreement  has  been  calculated,  a  document 
is  selected  for  a  class  label  request.  (Selecting  more 
than  one  document  at  a  time  can  be  a  computational 
convenience.)  We  consider  three  ways  of  selecting 
documents:  stream-based,  pool-based,  and  density- 
weighted  pool-based.  Some  previous  applications  of 
QBC  [Dagan  &  Engelson  1995;  Liere  &  Tadepalli  1997] 
use  a  simulated  stream  of  unlabeled  documents.  When 
a  document  is  produced  by  the  stream,  this  approach 
measures  the  classification  disagreement  among  the 
committee  members,  and  decides,  based  on  the  dis¬ 
agreement,  whether  to  select  that  document  for  la¬ 
beling.  Dagan  and  Engelson  do  this  heuristically  by 
dividing  the  vote  entropy  by  the  maximum  entropy  to 
create  a  probability  of  selecting  the  document.  Dis¬ 
advantages  of  using  stream-based  sampling  are  that  it 
only  sparsely  samples  the  full  distribution  of  possible 
document  labeling  requests,  and  that  the  decision  to 
label  is  made  on  each  document  individually,  irrespec¬ 
tive  of  the  alternatives. 

An  alternative  that  aims  to  address  these  problems 
is  pool-based  sampling.  It  selects  from  among  all 
the  unlabeled  documents  in  a  pool  the  one  with  the 
largest  disagreement.  However,  this  loses  one  bene¬ 
fit  of  stream-based  sampling — the  implicit  modeling 
of  the  data  distribution — and  it  may  select  documents 
that  have  high  disagreement,  but  are  in  unimportant, 
sparsely  populated  regions. 

We  can  retain  this  distributional  information  by  se¬ 
lecting  documents  using  both  the  classification  dis¬ 
agreement  and  the  “density”  of  the  region  around 
a  document.  This  density-weighted  pool-based  sam¬ 
pling  method  prefers  documents  with  high  classifica¬ 
tion  variance  that  are  also  similar  to  many  other  doc¬ 
uments.  The  stream  approach  approximates  this  im¬ 
plicitly;  we  accomplish  this  more  accurately,  (espe¬ 
cially  when  labeling  a  small  number  of  documents), 
by  modeling  the  density  explicitly. 

We  approximate  the  density  in  a  region  around  a  par¬ 
ticular  document  by  measuring  the  average  distance 
from  that  document  to  all  other  documents.  Distance, 
Y,  between  individual  documents  is  measured  by  using 
exponentiated  KL  divergence: 


Y{di,dh)  =  II  (AP(W|dO+(i-A)P(»v)))^ 

(8) 

where  W  is  a  random  variable  over  words  in  the 
vocabulary;  P(W|di)  is  the  maximum  likelihood  es¬ 
timate  of  words  sampled  from  document  di,  (t.e.. 


P{wt\di)  =  N{wt,di)/\di\);  P(W)  is  the  marginal  dis¬ 
tribution  over  words;  A  is  a  parameter  that  determines 
how  much  smoothing  to  use  on  the  encoding  distribu¬ 
tion  (we  must  ensure  no  zeroes  here  to  prevent  infinite 
distances);  and  /?  is  a  parameter  that  determines  the 
sharpness  of  the  distance  metric. 

In  essence,  the  average  KL  divergence  between  a  docu¬ 
ment,  di,  and  all  other  documents  measures  the  degree 
of  overlap  between  d;  and  all  other  documents;  expo¬ 
nentiation  converts  this  information-theoretic  number 
of  “bits  of  information”  into  a  scalar  distance. 

When  calculating  the  average  distance  from  di  to  all 
other  documents  it  is  much  more  computationally  ef¬ 
ficient  to  calculate  the  geometric  mean  than  the  arith¬ 
metic  mean,  because  the  distance  to  all  documents 
that  share  no  words  words  with  di  can  be  calculated 
in  advance,  and  we  only  need  make  corrections  for  the 
words  that  appear  in  di.  Using  a  geometric  mean,  we 
define  density,  Z  of  document  di  to  be 

Z{di)  =  e^  (9) 

We  combine  this  density  metric  with  disagreement  by 
selecting  the  document  that  has  the  largest  product  of 
density  (Equation  9)  and  disagreement  (Equation  6). 
This  density-weighted  pool-based  sampling  selects  the 
document  that  is  representative  of  many  other  docu¬ 
ments,  and  about  which  there  is  confident  committee 
disagreement. 

Combining  Active  Learning  and  EM 

Active  learning  can  be  combined  with  EM  by  run¬ 
ning  EM  to  convergence  after  actively  selecting  all  the 
training  data  that  will  be  labeled.  This  can  be  under¬ 
stood  as  using  active  learning  to  select  a  better  start¬ 
ing  point  for  EM  hill  climbing,  instead  of  randomly 
selecting  documents  to  label  for  the  starting  point.  A 
more  interesting  approach,  that  we  term  pool-leveraged 
sampling,  is  to  interleave  EM  with  active  learning,  so 
that  EM  not  only  builds  on  the  results  of  active  learn¬ 
ing,  but  EM  also  informs  active  learning.  To  do  this 
we  run  EM  to  convergence  on  each  committee  mem¬ 
ber  before  performing  the  disagreement  calculations. 
The  intended  effect  is  (1)  to  avoid  requesting  labels 
for  examples  whose  label  can  be  reliably  filled  in  by 
EM,  and  (2)  to  encourage  the  selection  of  examples 
that  will  help  EM  find  a  local  maximum  with  higher 
classification  accuracy.  With  more  accurate  commit¬ 
tee  members,  QBC  should  pick  more  informative  doc¬ 
uments  to  label.  The  complete  active  learning  algo- 
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rithm,  both  with  and  without  EM,  is  summarized  in 
Table  1. 

Unlike  settings  in  which  queries  must  be  generated 
[Cohn  1994],  and  previous  work  in  which  the  unlabeled 
data  is  available  as  a  stream  [Dagan  &  Engelson  1995; 
Liere  &  Tadepalli  1997;  Freund  et  al.  1997],  our  as¬ 
sumption  about  the  availability  of  a  pool  of  unlabeled 
data  makes  the  improvements  to  active  learning  pos¬ 
sible.  This  pool  is  present  for  many  real-world  tasks 
in  which  efficient  use  of  labels  is  important,  especially 
in  text  learning. 

5  Related  Work 

A  similar  approach  to  active  learning,  but  without  EM, 
is  that  of  Dagan  and  Engelson  [1995].  They  use  QBC 
stream-based  sampling  and  vote  entropy.  In  contrast, 
we  advocate  density-weighted  pool-based  sampling 
and  the  KL  metric.  Additionally,  we  select  committee 
members  using  the  Dirichlet  distribution  over  classi¬ 
fier  parameters,  instead  of  approximating  this  with  a 
Normal  distribution.  Several  other  studies  have  inves¬ 
tigated  active  learning  for  text  categorization.  Lewis 
and  Gale  examine  uncertainty  sampling  and  relevance 
sampling  in  a  pool-based  setting  [Lewis  &  Gale  1994; 
Lewis  1995].  These  techniques  select  queries  based  on 
only  a  single  classifier  instead  of  a  committee,  and  thus 
cannot  approximate  classification  variance.  Liere  and 
Tadepalli  [1997]  use  committees  of  Winnow  learners 
for  active  text  learning.  They  select  documents  for 
which  two  randomly  selected  committee  members  dis¬ 
agree  on  the  class  label. 

In  previous  work,  we  show  that  EM  with  unlabeled 
data  reduces  text  classification  error  by  one-third 
[Nigam  et  al.  1998].  Two  other  studies  have  used 
EM  to  combine  labeled  and  unlabeled  data  without 
active  learning  for  classification,  but  on  non-text  tasks 
[Miller  k  Uyar  1997;  Shahshahani  &  Landgrebe  1994]. 
Ghahramani  and  Jordan  [1994]  use  EM  with  mixture 
models  to  fill  in  missing  feature  values. 

6  Experimental  Results 

This  section  provides  evidence  that  using  a  combina¬ 
tion  of  active  learning  and  EM  performs  better  than 
using  either  individually.  The  results  are  based  on  data 
sets  from  UseNet  and  Reuters.® 

®These  data  sets  are  both  available  on  the  In¬ 
ternet.  See  http://www.cs.cmu.edu/~textleaxning  and 
http:  / /www.reseaxch. att.com/ ~lewis. 


The  Newsgroups  data  set,  collected  by  Ken  Lang,  con¬ 
tains  about  20,000  articles  evenly  divided  among  20 
UseNet  discussion  groups  [Joachims  1997].  We  use 
the  five  comp .  *  classes  as  our  data  set.  When  tokeniz- 
ing  this  data,  we  skip  the  UseNet  headers  (thereby 
discarding  the  subject  line);  tokens  are  formed  from 
contiguous  alphabetic  characters,  removing  words  on 
a  stoplist  of  common  words.  Best  performance  was 
obtained  with  no  feature  selection,  no  stemming,  and 
by  normalizing  word  counts  by  document  length.  The 
resulting  vocabulary,  after  removing  words  that  occur 
only  once,  has  22958  words.  On  each  trial,  20%  of  the 
documents  axe  randomly  selected  for  placement  in  the 
test  set. 

The  ‘ModApte’  treiin/test  split  of  the  Reuters  21578 
Distribution  1.0  data  set  consists  of  12902  Reuters 
newswire  articles  in  135  overlapping  topic  categories. 
Following  several  other  studies  [Joachims  1998;  Liere 
&  Tadepalli  1997]  we  build  binary  classifiers  for  each 
of  the  10  most  populous  classes.  We  ignore  words  on 
a  stoplist,  but  do  not  use  stemming.  The  resulting  vo¬ 
cabulary  has  19371  words.  Results  are  reported  on  the 
complete  test  set  as  precision-recall  breakeven  points, 
a  standard  information  retrieval  measure  for  binary 
classification  [Joachims  1998]. 

In  our  experiments,  an  initial  classifier  was  trained 
with  one  randomly-selected  labeled  document  per 
class.  Active  learning  proceeds  as  described  in  Table  1. 
Newsgroups  experiments  were  run  for  200  active  leeirn- 
ing  iterations,  each  round  selecting  one  document  for 
labeling.  Reuters  experiments  were  run  for  100  itera¬ 
tions,  each  round  selecting  five  documents  for  labeling. 
Smoothing  parameter  A  is  0.5;  sharpness  parameter  P 
is  3.  We  made  little  effort  to  tune  P  and  none  to  tune 
A.  For  QBC  we  use  a  committee  size  of  three  {k=3); 
initial  experiments  show  that  committee  size  has  lit¬ 
tle  effect.  All  EM  runs  perform  seven  EM  iterations; 
we  never  found  classification  accuracy  to  improve  be¬ 
yond  the  seventh  iteration.  All  results  presented  are 
averages  of  ten  runs  per  condition. 

The  top  graph  in  Figure  1  shows  a  comparison  of  dif¬ 
ferent  disagreement  metrics  and  selection  strategies 
for  QBC  without  EM.  The  best  combination,  density- 
weighted  pool-based  sampling  with  a  KL  divergence  to 
the  mean  disagreement  metric  achieves  51%  accuracy 
after  acquiring  only  30  labeled  documents.  To  reach 
the  same  accuracy,  unweighted  pool-based  sampling 
with  KL  disagreement  needs  40  labeled  documents. 
If  we  switch  to  stream-based,  sampling,  KL  disagree¬ 
ment  needs  51  labelings  for  51%  accuracy.  Our  ran¬ 
dom  selection  baseline  requires  59  labeled  documents. 


356 


McCallum  and  Nigam 


0  20  40  60  60  100  120  140  160  180  200 

Numbar  of  Training  Documents 

Figure  1:  On  the  top,  a  comparison  of  disagreement  met¬ 
rics  and  selection  strategies  for  QBC  shows  that  density- 
weighted  pool-based  KL  sampling  does  better  than  other 
metrics.  On  the  bottom,  combinations  of  QBC  and  EM 
outperform  stand-alone  QBC  or  EM.  In  these  cases,  QBC 
uses  density-weighted  pool-based  KL  sampling.  Note  that 
the  order  of  the  legend  matches  the  order  of  the  curves  and 
that,  for  resolution,  the  vertical  axes  do  not  range  from  0 
to  100. 


Surprisingly,  stream-based  vote  entropy  does  slightly 
worse  than  random,  needing  61  documents  for  the  51% 
threshold.  Density-weighted  pool-based  sampling  with 
a  KL  metric  is  statistically  significantly  better  than 
each  of  the  other  methods  (p  <  0.005  for  each  pairing). 
It  is  interesting  to  note  that  the  first  several  documents 
selected  by  this  approach  are  usually  FAQs  for  the  var¬ 
ious  newsgroups.  Thus,  using  a  pool  of  unlabeled  data 
can  notably  improve  active  learning. 

In  contrast  to  earlier  work  on  part-of-speech  tagging 
[Dagan  &  Engelson  1995],  vote  entropy  does  not  per¬ 
form  well  on  document  classification.  In  our  experi¬ 
ence,  vote  entropy  tends  to  select  outliers — documents 
that  are  short  or  unusual.  We  conjecture  that  this  oc¬ 
curs  because  short  documents  and  documents  consist¬ 
ing  of  infrequently  occurring  words  are  the  documents 
that  most  easily  have  their  classifications  changed  by 
perturbations  in  the  classifier  parameters.  In  these 
situations,  classification  variance  is  high,  but  the  dif¬ 


ference  in  magnitude  between  the  classification  score 
of  the  winner  and  the  losers  is  small.  For  vote  en¬ 
tropy,  these  are  prime  selection  candidates,  but  KL 
divergence  accounts  for  the  magnitude  of  the  differ¬ 
ences,  and  thus  helps  measure  the  confidence  in  the 
disagreement.  Furthermore,  incorporating  density¬ 
weighting  biases  selection  towards  longer  documents, 
since  these  documents  have  word  distributions  that  are 
more  representative  of  the  corpus,  and  thus  are  consid¬ 
ered  “more  dense.”  It  is  generally  better  to  label  long 
rather  than  short  documents  because,  for  the  same  la¬ 
beling  effort,  a  long  document  provides  information 
about  more  words.  Dagan  and  Engelson’s  domain, 
part-of-speech  tagging,  does  not  have  varying  length 
examples;  document  classification  does. 

Now  consider  the  addition  of  EM  to  the  learning 
scheme.  Our  EM  baseline  post-processes  random  se¬ 
lection  with  runs  of  EM  (Random-then-EM).  The  most 
straightforward  method  of  combining  EM  and  ac¬ 
tive  learning  is  to  run  EM  after  active  learning  com¬ 
pletes  (QBC-then-EM).  We  also  interleave  EM  and 
active  learning,  by  running  EM  on  each  committee 
member  (QBC-with-EM).  This  also  includes  a  post¬ 
processing  run  of  EM.  In  QBC,  documents  are  selected 
by  density-weighted  pool-based  KL,  as  the  previous  ex¬ 
periment  indicated  was  best.  Random  selection  (Ran¬ 
dom)  and  QBC  without  EM  (QBC)  are  repeated  from 
the  previous  experiment  for  comparison. 

The  bottom  graph  of  Figure  1  shows  the  results  of 
combining  EM  and  active  learning.  Starting  with  the 
30  labeling  mark  again,  QBC-then-EM  is  impressive, 
reaching  64%  accuracy.  Interleaved  QBC-with-EM  lags 
only  slightly,  requiring  32  labeled  documents  for  64% 
accuracy.  Random-then-EM  is  the  next  best  performer, 
needing  51  labeled  documents.  QBC,  without  EM, 
takes  118  labeled  documents,  and  our  baseline.  Ran¬ 
dom,  takes  179  labeled  documents  to  reach  64%  accu¬ 
racy.  QBC-then-EM  and  QBC-with-EM  are  not  statis¬ 
tically  significantly  different  {p  =  0.71  N.S.);  these  two 
are  each  statistically  significantly  better  than  each  of 
the  other  methods  at  this  threshold  (p  <  0.05). 

These  results  indicate  that  the  combination  of  EM 
and  active  learning  provides  a  large  benefit.  However, 
QBC  interleaved  with  EM  does  not  perform  better 
than  QBC  followed  by  EM — not  what  we  were  expect¬ 
ing.  We  hypothesize  that  while  the  interleaved  method 
tends  to  label  documents  that  EM  cannot  reliably  la¬ 
bel  on  its  own,  these  documents  do  not  provide  the 
most  beneficial  starting  point  for  EM’s  hill-climbing. 
In  ongoing  work  we  are  examining  this  more  closely 
and  investigating  improvements. 
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Figure  2:  A  comparison  of  random  initial  labeling  and  no 
initial  labeling  when  documents  axe  selected  with  density- 
weighted  pool-based  sampling.  Note  that  no  initial  labeling 
tends  to  dominate  the  random  initial  labeling  cases. 


Another  application  of  the  unlabeled  pool  to  guiding 
active  learning  is  the  selection  of  the  initial  labeled  ex¬ 
amples.  Several  previous  implementations  [Dagan  &: 
Engelson  1995;  Lewis  &  Gale  1994;  Lewis  1995]  sup¬ 
pose  that  the  learner  is  provided  with  a  collection  of 
labeled  examples  at  the  beginning  of  active  learning. 
However,  obtaining  labels  for  these  initial  examples 
(and  making  sure  we  have  examples  from  each  class) 
can  itself  be  an  expensive  proposition.  Alternatively, 
our  method  can  begin  without  any  labeled  documents, 
sampling  from  the  Dirichlet  distribution  and  select¬ 
ing  with  density-weighted  metrics  as  usual.  Figure  2 
shows  results  from  experiments  that  begin  with  zero 
labeled  documents,  and  use  the  structure  of  the  un¬ 
labeled  data  pool  to  select  initial  labeling  requests. 
Interestingly,  this  approach  is  not  only  more  conve¬ 
nient  for  many  real-world  tasks,  but  also  performs 
better  because,  even  without  any  labeled  documents, 
it  can  still  select  documents  in  dense  regions.  With 
70  labeled  documents,  QBC  initialized  with  one  (ran¬ 
domly  selected)  document  per  class  attains  an  average 
of  59%  accuracy,  while  QBC  initialized  with  none  (re¬ 
lying  on  density-weighted  KL  divergence  to  select  adl 
70)  attains  an  average  of  63%.  Performance  also  in¬ 
creased  with  EM;  QBC-with-EM  rises  from  69%  to  72% 
when  active  learning  begins  with  zero  labeled  docu¬ 
ments.  Each  of  these  differences  is  statistically  signif¬ 
icant  (p  <  0.005).  Both  with  and  without  EM,  this 
method  successfully  finds  labeling  requests  to  cover  all 
classes.  As  before,  the  first  requests  tend  to  be  FAQs 
or  similar,  long,  informative  documents. 

In  comparison  to  previous  active  learning  studies 
in  text  classification  domains  [Lewis  &  Gale  1994; 
Liere  k  Tadepalli  1997],  the  magnitude  of  our  clas¬ 
sification  accuracy  increase  is  relatively  modest.  Both 
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Figure  3;  Active  learning  results  on  three  categories  of 
the  Reuters  data,  corn,  trade,  and  acq,  respectively  from 
the  top  and  in  increasing  order  of  frequency.  Note  that 
active  le2irning  with  committees  outperforms  random  se¬ 
lection  and  that  the  magnitude  of  improvement  is  larger 
for  more  infrequent  classes. 


of  these  previous  studies  consider  binary  classifiers 
with  skewed  distributions  in  which  the  positive  class 
has  a  very  small  prior  probability.  With  a  very  in¬ 
frequent  positive  class,  random  selection  should  per¬ 
form  extremely  poorly  because  nearly  all  documents 
selected  for  labeling  will  be  from  the  negative  class. 
In  tasks  where  the  class  priors  are  more  even,  random 
selection  should  perform  much  better — making  the  im¬ 
provement  of  active  learning  less  dramatic.  With  an 
eye  towards  testing  this  hypothesis,  we  perform  a  sub¬ 
set  of  our  previous  experiments  on  the  Reuters  data 
set,  which  has  these  skewed  priors.  We  compare  Ran¬ 
dom  against  unweighted  pool-based  sampling  (QBC) 
with  the  KL  disagreement  metric. 
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Figure  3  shows  results  for  three  of  the  ten  binary  clas¬ 
sification  tasks.  The  frequencies  of  the  positive  classes 
are  0.018,  0.038  and  0.184  for  the  corn  (top),  trade 
(middle)  and  acq  (bottom)  graphs,  respectively.  The 
class  frequency  and  active  learning  results  are  repre¬ 
sentative  of  the  spectrum  of  the  ten  classes.  In  all 
cases,  active  learning  classification  is  more  accurate 
than  Random.  After  252  labelings,  improvements  of 
accuracy  over  random  are  from  27%  to  53%  for  corn, 
48%  to  68%  for  trade,  and  85%  to  90%  for  acq.  The 
distinct  trend  across  all  ten  categories  is  that  the  less 
frequently  occurring  positive  classes  show  larger  im¬ 
provements  with  active  learning.  Thus,  we  conclude 
that  our  earlier  accuracy  improvements  are  good,  given 
that  with  unskewed  class  priors.  Random  selection  pro¬ 
vides  a  relatively  strong  performance  baseline. 

7  Conclusions 

This  paper  demonstrates  that  by  leveraging  a  large 
pool  of  unlabeled  documents  in  two  ways — using  EM 
and  density-weighted  pool-based  sampling — we  can 
strongly  reduce  the  need  for  labeled  examples.  In  fu¬ 
ture  work,  we  will  explore  the  use  of  a  more  direct  ap¬ 
proximation  of  the  expected  reduction  in  classification 
variance  across  the  distribution.  We  will  consider  the 
effect  of  the  poor  probability  estimates  given  by  naive 
Bayes  by  exploring  other  classifiers  that  give  more  re¬ 
alistic  probability  estimates.  We  will  also  further  in¬ 
vestigate  ways  of  interleaving  active  learning  and  EM 
to  achieve  a  more  than  additive  benefit. 
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Abstract 

When  documents  are  organized  in  a  large 
number  of  topic  categories,  the  categories 
are  often  arranged  in  a  hierarchy.  The  U.S. 
patent  database  and  Yahoo  are  two  examples. 

This  paper  shows  that  the  accuracy  of  a  naive 
Bayes  text  classifier  can  be  significantly  im¬ 
proved  by  taking  advantage  of  a  hierarchy  of 
classes.  We  adopt  an  established  statistical 
technique  called  shrinkage  that  smoothes  pa¬ 
rameter  estimates  of  a  data-sparse  child  with 
its  parent  in  order  to  obtain  more  robust  pa¬ 
rameter  estimates.  The  approach  is  also  em¬ 
ployed  in  deleted  interpolation,  a  technique 
for  smoothing  n-grams  in  language  modeling 
for  speech  recognition. 

Our  method  scales  well  to  large  data  sets, 
with  numerous  categories  in  large  hierarchies. 
Experimental  results  on  three  real-world  data 
sets  from  UseNet,  Yahoo,  and  corporate  web 
pages  show  improved  performance,  with  a  re¬ 
duction  in  error  up  to  29%  over  the  tradi¬ 
tional  flat  classifier. 

1  Introduction 

As  the  dramatic  expansion  of  the  World  Wide  Web 
continues,  and  the  amount  of  on-line  text  grows, 
the  development  of  methods  for  automatically  cate¬ 
gorizing  this  text  becomes  more  important.  A  va¬ 
riety  of  recent  work  has  demonstrated  the  success 
of  statistical  approaches  for  learning  to  classify  text 
documents  [Joachims  1997;  Koller  &  Sahami  1997; 
Yang  &  Pederson  1997;  Nigam  et  al.  1998].  These 
approaches,  such  as  TFIDF  [Salton  1991]  and  naive 
Bayes  [Lewis  &  Ringuette  1994;  McCallum  &  Nigam 


1998],  typically  represent  documents  as  vectors  of 
words,  and  learn  by  gathering  statistics  from  the  ob¬ 
served  frequencies  of  these  words  within  documents 
belonging  to  the  different  classes.  Because  they  rely 
on  these  learned  word  statistics,  these  approaches  are 
data-intensive:  they  often  require  large  numbers  of 
hand-labeled  training  documents  per  class  to  achieve 
high  classification  accuracy. 

This  paper  considers  the  question  of  how  to  scale  up 
these  statistical  learning  algorithms  to  tasks  with  a 
large  number  of  classes  and  sparse  training  data  per 
class.  When  humans  organize  extensive  data  sets  into 
fine-grained  categories,  topic  hierarchies  are  often  em¬ 
ployed  to  make  the  large  collection  of  categories  more 
manageable.  Yahoo,  the  U.S.  patent  database,  MED¬ 
LINE  and  the  Dewey  Decimal  System  are  all  examples 
of  such  hierarchies. 

We  present  a  technique  that  leverages  these 
commonly-available  topic  hierarchies  in  order  to  sig¬ 
nificantly  improve  classification  accuracy,  especially 
when  the  hierarchy  is  large  and  the  training  data  for 
each  class  is  sparse.  We  also  present  a  method  for  ex¬ 
ponentially  reducing  the  amount  of  computation  nec¬ 
essary  for  classification,  while  sacrificing  only  a  small 
amount  of  accuracy. 

Our  approach  applies  a  well-understood  technique 
from  Statistics  called  shrinkage  that  provides  improved 
estimates  of  parameters  that  would  otherwise  be  un¬ 
certain  due  to  limited  amounts  of  training  data  [Stein 
1955;  James  &  Stein  1961].  The  technique  exploits  a 
hierarchy  by  “shrinking”  parameter  estimates  in  data- 
sparse  children  toward  the  estimates  of  the  data-rich 
ancestors  in  ways  that  are  provably  optimal  under  the 
appropriate  conditions.  We  employ  a  simple  form  of 
shrinkage  that  creates  new  parameter  estimates  in  a 
child  by  a  linear  interpolation  of  all  hierarchy  nodes 
from  the  child  to  the  root.  The  interpolation  weights 
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are  learned  by  a  form  of  Expectation  Maximization 
[Dempster,  Laird,  &  Rubin  1977].  This  form  of  shrink¬ 
age  is  also  applied  in  deleted  interpolation,  a  tech¬ 
nique  for  smoothing  n-grams  in  language  modeling  for 
speech  recognition  [Jelinek  &  Mercer  1980]. 

Note  that  our  approach  to  text  classification  in  a  hi¬ 
erarchy  is  quite  different  than  work  by  Roller  and  Sa- 
hami  [Roller  k  Sahami  1997].  Their  Pachinko  Ma¬ 
chine  employs  the  hierarchy  by  learning  separate  clas¬ 
sifiers  at  each  internal  node  of  the  tree,  and  then  la¬ 
beling  a  document  by  using  these  classifiers  to  greed¬ 
ily  select  sub-branches  until  it  reaches  a  leaf.  Their 
approach  is  shown  to  be  helpful  when  documents  are 
represented  using  a  small  subset  (<  100  words)  of 
the  available  vocabulary,  and  a  different  subset  of 
the  vocabulary  is  selected  at  each  node  of  the  hi¬ 
erarchy.  However,  their  approach  did  not  show  im¬ 
provement  with  larger  vocabularies,  and  in  many  do¬ 
mains  (including  the  domains  studied  in  this  paper) 
it  has  been  established  that  large  vocabulary  sizes  of¬ 
ten  perform  best  [Joachims  1997;  Nigam  et  al.  1998; 
McCallum  &  Nigam  1998]. 

Somewhat  surprisingly,  it  can  be  shown  that  a  prob¬ 
abilistic  form  of  Pachinko  Machine,  when  trained  us¬ 
ing  maximum  likelihood  estimates  and  a  constant  vo¬ 
cabulary,  is  equivalent  to  the  simple  non-hierarchical 
classifier  [Mitchell  1998].  At  each  node  in  the  hier¬ 
archy  this  non-deterministic  version  of  the  Pachinko 
Machine  assigns  each  document  probabilistically  to  all 
of  its  descendants,  whereas  the  deterministic  Pachinko 
Machine  proposed  by  Roller  and  Sahami  assigns  each 
document  to  its  single  most  probable  descendant. 

The  remainder  of  this  paper  is  structured  as  follow's: 
we  explain  our  probabilistic  approach  to  text  classifi¬ 
cation,  and  present  the  use  of  shrinkage  in  this  context. 
Then  we  show  experimental  results  on  three  real-world 
data  sets,  present  related  work,  and  close  with  a  dis¬ 
cussion  of  future  work. 

2  Probabilistic  Framework 

We  approach  the  task  of  text  classification  in  a 
Bayesian  learning  framework.  We  assume  that  the 
text  data  was  generated  by  a  parametric  model,  and 
use  training  data  to  calculate  estimates  of  the  model 
parameters.  Then,  equipped  with  these  estimates,  we 
classify  new  test  documents  by  using  Bayes  rule  to 
turn  the  generative  model  around  and  calculate  the 
posterior  probability  that  a  class  would  have  generated 
the  test  document  in  question.  Classification  then  be¬ 
comes  a  simple  matter  of  selecting  the  most  probable 


class  given  the  document’s  words. 

We  assume  that  the  data  is  generated  by  a  mixture 
model,  (parameterized  by  6),  with  a  one-to-one  cor¬ 
respondence  between  mixture  model  components  and 
(the  observed)  classes,  cj  6  {C}.  This  specifies  that 
a  document,  di,  is  created  by  (1)  selecting  a  class,  Cj, 
according  to  the  class  priors,  P(cj|0),  then  (2)  hav¬ 
ing  the  corresponding  mixture  component  generate  a 
document  according  to  its  own  parameters,  with  dis¬ 
tribution  P{di\cj-,9).  The  marginal  probability  of  gen¬ 
erating  document  di  is  thus  a  sum  of  total  probability 
over  all  mixture  components: 

|C| 

P(d.|0)  =  ^P(c,|0)P(di|c,;0).  (1) 

j=i 

A  document  is  comprised  of  an  ordered  sequence  of 
word  events,  drawn  from  a  vocabulary  V.  We  make  the 
naive  Bayes  assumption:  that  the  probability  of  each 
word  event  in  a  document  is  independent  of  the  word’s 
context  given  the  class,  and  furthermore  independent 
of  its  position  in  the  document.  Thus,  each  document 
di  is  drawn  from  a  multinomial  distribution  with  as 
many  independent  trials  as  the  number  of  words  in 
di-  We  also  assume  that  document  lengths,  |di|,  are 
independent  of  class.  We  write  Wdn,  for  the  word  in 
position  k  of  document  di ,  where  the  subscript  of  w  (in 
this  case  da)  indicates  an  index  into  the  vocabulary. 
Then  the  probability  of  a  document  given  its  class  is: 

P(d,|c,;0)  =  P(|di|)nP(’^rf.Jc^;^)-  (2) 

k=l 

Given  the  assumption  about  one-to-one  correspon¬ 
dence  between  mixture  model  components  and  classes, 
the  naive  Bayes  assumption,  and  the  position  indepen¬ 
dence  assumption,  the  mixture  model  is  composed  of 
disjoint  sets  of  parameters,  9j,  for  each  class  cj.  This 
parameter  set  for  each  class,  9j,  is  composed  of  prob¬ 
abilities  for  each  word,  wt,  such  that  9jt  =  P{wt\cj]9) 
and  Y}IW  =  1.  The  only  other  parameters  in 
the  model  are  the  class  prior  probabilities,  written 
9oj  =  P(cj|&). 

Given  a  set  of  labeled  training  documents,  V,  we  can 
calculate  estimates  for  the  parameters  of  the  model 
that  generated  the  documents.  These  estimates  con¬ 
sist  of  straightforward  counting  of  events,  supple¬ 
mented  by  standard  Laplace  ‘smoothing’  that  primes 
each  estimate  with  a  count  of  one  to  avoid  probabili¬ 
ties  of  zero.  We  define  N{wt ,  d,  )  to  be  the  count  of  the 
number  of  times  word  Wt  occurs  in  document  di,  and 
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define  P(cj|dj)  G  {0,1},  as  given  by  the  document’s 
class  label.  Then,  the  estimate  of  the  probability  of 
word  Wt  in  class  cj  is 


X 


4 

alt.Bthelsm 


P  ("believe")  = 
UNIFORM  '''' 


Ojt  =  p{wt\cj-,e) 


l  +  Egj  N{wt,di)nci\di) 

l^l  +  Ei='iEl=i^K>^OP(cildi)' 

(3) 


The  class  prior  parameters  are  set  by  the  maximum 
likelihood  estimate: 


\v\ 

9oj  =  P{cj\e)  =  '£picj\di)/\v\. 


i=l 


(4) 


Given  estimates  of  these  parameters  calculated  from 
the  training  documents,  classification  can  be  per¬ 
formed  on  test  documents  by  calculating  the  posterior 
probability  of  each  class  given  the  words  observed  in 
the  test  document,  and  selecting  the  class  with  the 
highest  probability.  We  formulate  this  by  first  apply¬ 
ing  Bayes  rule,  and  then  substituting  for  P{di\cj;d) 
and  P(cii|0)  using  Equations  1  and  2. 


P{cj\di-,e) 


p{cAe)P{di\cj-,e)  ^ 

Pidi\9) 

Ei='iP(crl0)nLt'iP(«^<i.Jc.;0) 


''’alt.aiheism 


P  ("believe"  I  ali.atheism)  = 

SHRINKAGE 


^kameisn,  P  ("beHeve"  I  alt.atheism)  + 

MLE 

^Laiheism  P  ("believe"  I  Religion)  + 
MLE 

^Laaieism  P  ("believe"  I  Root)  + 
MLE 


X 


4 

alLathdsm 


P  ("believe") 

UNIFORM 


Figure  1:  The  new,  shrinkage-based  estimate  of  the  proba¬ 
bility  of  a  word  {e.g.  “believe”)  given  a  UseNet  class  (e.g. 
alt.atheism)  is  a  weighted  sum  of  the  maximum-likelihood 
estimates  from  the  leaf  to  the  root,  and  beyond  the  root  to 
the  uniform  distribution  over  words. 


Despite  the  fact  that  the  mixture  model  and  word 
independence  assumptions  are  strongly  violated  with 
real-world  data,  naive  Bayes  performs  text  classifica¬ 
tion  very  well.  Friedman  and  Domingos  and  Pazzani 
discuss  why  the  violation  of  the  word  independence 
assumption  sometimes  does  little  damage  to  classifi¬ 
cation  accuracy  [Friedman  1997;  Domingos  &  Pazzani 
1997]. 

3  Hierarchical  Classification 

This  section  presents  a  method  of  improving  our  es¬ 
timates  of  the  model  parameters  by  taking  advantage 
of  the  hierarchy.  We  first  briefly  describe  shrinkage 
in  a  general  sense,  then  discuss  its  application  to  text 
classification  in  a  hierarchy,  and  the  mechanics  of  our 
algorithm. 

Background  on  Shrinkage 

We  wish  to  estimate  parameters  9i,. . .  ,0|c|,  (*-e.  each 
class’s  probability  distribution  over  words).  The  es¬ 
timates  9j  of  9j  can  often  be  improved  by  shrinking 


each  of  them  towards  some  common  value.  See  Carlin 
and  Louis  [1996]  for  a  recent  summary  of  shrinkage. 
There  are  two  justifications  for  shrinkage.  First,  if  the 
quantities  9i,..  ■,9\c\  axe  thought  to  be  similar,  then 
they  can  regarded  as  draws  from  a  common  distribu¬ 
tion.  In  this  case,  the  shrinkage  estimator  is  just  the 
Bayes  estimate.  More  surprisingly,  even  if  the  quan¬ 
tities  are  completely  unrelated,  and  even  if  the  data 
upon  which  each  estimator  is  based  are  independent 
of  each  other,  shrinkage  estimators  still  reduce  the  risk 
of  the  estimators.  This  is  a  deep  and  counterintuitive 
fact  discovered  by  Stein  [1955]  and  James  and  Stein 
[1961]. 

Shrinkage  for  Text  Classification 

We  use  shrinkage  to  better  estimate  the  probability  9jt 
of  word  Wt  given  class  Cj.  For  each  node  in  our  tree  we 
construct  a  maximum  likelihood  (ML)  estimate  based 
on  the  data  associated  with  that  node  (Equation  3 
without  the  Laplace  smoothing).  An  improved  esti¬ 
mate  for  each  leaf  node  is  then  derived  by  “shrinking” 
its  ML  estimate  towards  the  ML  estimates  of  all  its  an- 
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cestors,  namely  those  estimates  found  along  the  path 
from  that  leaf  to  the  root.  Figure  1  illustrates  this  pro¬ 
cess.  In  statistical  language  modeling  terms,  we  build 
a  unigram  model  for  each  node  in  the  tree,  and  smooth 
each  leaf  model  by  linearly  interpolating  it  with  all  the 
models  found  along  the  path  to  the  root. 

The  estimates  along  a  path  from  the  leaf  to  the  root 
represent  a  tradeoff  between  specificity  and  reliability. 
The  estimate  at  the  leaf  is  the  most  specific  (most 
pertinent,  least  biased),  since  it  is  based  on  data  from 
that  topic  alone.  However  it  is  also  the  least  reliable, 
since  it  is  based  on  the  smallest  sample  of  data.  The 
estimator  at  the  root  is  the  most  reliable,  but  the  least 
specific. 

Since  even  the  root  contains  a  finite  amount  of  data, 
it  may  estimate  some  rare  words  unreliably.  We  there¬ 
fore  extend  the  tree  by  adding,  beyond  the  root,  the 
uniform  estimate.  Thanks  to  the  latter,  we  no  longer 
need  to  smooth  the  individual  ML  estimates  with  the 
Laplacean  prior. 

To  ensure  that  the  ML  estimates  along  a  given  path 
are  independent,  we  subtract  each  child’s  data  from 
its  parent’s  before  calculating  the  parent’s  ML  esti¬ 
mate.  Thus  the  latter  estimate  is  based  on  data  that 
belongs  to  all  the  siblings  of  said  child,  but  not  to  the 
child  itself.  Note  that  in  this  way,  for  any  path  from 
leaf  to  root,  every  datum  in  the  tree  is  used  in  exactly 
one  of  the  ML  estimates,  providing  both  independence 
among  the  estimates  and  efficient  use  of  the  training 
data. 

Determining  Mixture  Weights 

Given  a  set  of  ML  estimates  along  the  path  from  a 
leaf  to  the  root  (and  beyond  it,  to  the  uniform  esti¬ 
mate),  how  do  we  decide  on  the  weights  for  interpo¬ 
lating  (mixing)  them?  Let  . . .  ,6j}  be  k  such 

estimates,  where  01  =  9j  is  the  estimate  at  the  leaf, 
and  9j  is  the  uniform  estimate  (0*^  =  l/IF]  for  all 
words  wt),  and  A:  — 2  is  the  depth  of  class  cj  in  the  tree. 
The  interpolation  weights  among  the  ancestors  of  class 
Cj  are  written  {A],  A|, . . . ,  A*},  where  Yli=i  =  1- 

We  write  9j  for  the  new  estimate  of  the  class- 
conditioned  word  probabilities  based  on  shrinkage. 
The  new  estimate  for  the  probability  of  word  wt  given 
class  Cj  is 

9jt  =  F{wt\cj-,  9j)  =  X]9],  +  X]0],  +  ...  +  A)0*,.  (6) 

We  derive  empirically  optimal  weights.  A),  between 
the  ancestors  of  Cj,  by  finding  the  weights  that  maxi¬ 


mize  the  likelihood  of  some  hitherto  unseen  “held-out” 
data.  We  use  the  fact  that  the  likelihood  of  data  ac¬ 
cording  to  the  mixture  model  is  a  convex  function  of 
the  weights  (this  falls  out  of  Jensen’s  inequality),  and 
thus  attains  a  single,  global  maximum.  We  find  that 
maximum  for  each  leaf  class,  Cj,  using  the  following 
iterative  procedure: 


Initialize;  Set  the  Aj’s  to  some  initial  values,  say  A)  = 
^  (any  normalized  non-zero  initial  values  will  do). 

Iterate: 

(1)  Calculate  the  degree  to  which  each  estimate  pre¬ 
dicts  the  words  wt  in  the  held-out  set,  Tij,  from  class 


F{9'j  was  used  to  generate  Wt) 


E 

WiClHs 


..E 


(7) 


(2)  Derive  new  (and  guaranteed  improved)  weights  by 
normalizing  the  /?’s: 


A 


t 


(8) 


Terminate:  Upon  convergence  of  the  likelihood  func¬ 
tion  (usually  achieved  within  a  dozen  or  so  iterations). 


This  algorithm  can  be  viewed  as  a  particularly  simple 
form  of  EM  [Dempster,  Laird,  &  Rubin  1977],  where 
each  datum  is  assumed  to  have  been  generated  by  first 
choosing  one  of  the  tree  nodes  in  the  path  to  the  root, 
say  9t  (with  probability  A)),  then  using  that  estimate 
to  generate  that  datum.  EM  then  maximizes  the  total 
likelihood  when  the  choices  of  estimates  made  for  the 
various  data  are  unknown.  The  first  step  in  the  iter¬ 
ative  part  is  thus  the  “E”  step,  and  the  second  one  is 
the  “M”  step. 

While  conceptually  simple,  this  method  makes  ineffi¬ 
cient  use  of  the  available  training  data  by  carving  off 
some  of  it  to  be  used  as  a  held-out  set.  To  overcome 
this  problem,  we  modify  the  algorithm  as  follows:  all 
the  available  data  is  used  both  to  construct  the  ML  es¬ 
timates  and  to  optimize  the  weights.  However,  as  each 
document  is  used  in  the  above  algorithm,  the  ML  esti¬ 
mates  are  modified  to  exclude  its  data,  so  as  to  make 
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^  training 
documents 

Class 

child 

Mixture  Weights 
parent  g’parent 

uniform 

root /politics /talk.politics.guns 

0.368 

0.092 

0.017 

0.522 

root/politics/talk.politics.mideast 

0.256 

0.132 

0.001 

0.611 

235 

root /politics /talk.politics.misc 

0.197 

0.213 

0.026 

0.564 

root /religion/  alt .  atheism 

0.235 

0.158 

0.022 

0.585 

root/religion/soc.religion.christian 

0.181 

0.189 

0.052 

0.578 

root /religion/talk.religion.misc 

0.104 

0.255 

0.028 

0.613 

root  /politics/talk.politics.guns 

0.801 

0.089 

0.048 

0.061 

root/politics/talk.politics.mideast 

0.859 

0.061 

0.010 

0.071 

7497 

root/politics/talk.politics.misc 

0.762 

0.126 

o:o43 

0.068 

root /religion /alt .  atheism 

0.766 

0.174 

0.043 

0.018 

root/religion/soc.religion. Christian 

0.837 

0.098 

0.041 

024 

root /religion /talk.religion.misc 

0.663 

0.226 

0.049 

0.062 

Table  1:  Mixture  weights  learned  by  EM  for  some  nodes  in  the  UseNet  class  hierarchy  described  in  section  4.  Notice  that 
when  training  data  is  sparse  (top  half  of  table),  classes  mix  more  strongly  with  their  parents  than  when  data  is  plentiful. 
Notice  also  that  more  ‘generic’  classes  mix  more  strongly  with  their  parents,  e.g.  talk.politics.misc’s  weight  on  its  parent 
is  higher  than  is  talk.politics.guns’s). 


them  independent  of  it.  This  method  is  very  similar  to 
the  “leave-one-out”  cross-validation  commonly  used  in 
statistical  estimation. 

This  technique  of  finding  the  optimal  weights  is  rou¬ 
tinely  used  in  statistical  language  modeling  to  inter¬ 
polate  together  different  models  (such  as  trigram,  bi¬ 
gram,  unigram  and  uniform),  where  it  is  known  as 
“deleted  interpolation”  [Jelinek  &  Mercer  1980].  It 
was  similarly  used  to  interpolate  estimates  from  nodes 
along  a  tree  path  in  [Bahl  et  al.  1989].  This  cross- 
validation  approach  to  setting  the  mixture  weights  is 
not  exactly  the  same  style  of  shrinkage  as  Stein  [1955] 
and  James  and  Stein  [1961],  but  is  similar  in  spirit. 
In  future  work  we  will  compare  the  different  styles  of 
shrinkage. 

Table  1  shows  a  subset  of  the  mixture  weights  learned 
by  EM  for  a  hierarchy  based  on  UseNet  articles. 

4  Experimental  Results 

This  section  provides  empirical  evidence  that  shrink¬ 
age  reduces  text  classification  error  by  up  to  29%.  We 
also  show  that  shrinkage  helps  most  when  training 
data  is  sparse  and  the  number  of  classes  is  large.  Fi¬ 
nally,  we  demonstrate  that  dynamically  pruning  the 
tree  can  exponentially  reduce  computation  time,  at 
minimal  loss  of  accuracy.  Experiments  are  based  on 
three  different  real-world  data  sets,  one  consisting  of 
UseNet  articles,  and  two  of  web  pages.^  All  the  results 
are  averages  of  ten  cross-validation  trials. 

^All  three  data  sets  are  available  on-line.  See 
http://www.cs.cmu.edu/~textlearning. 


The  Industry  Sector  hierarchy,  made  available  by  Mar¬ 
ket  Guide  Inc.  (www.marketguide.com),  consists  of 
company  web  pages  classified  in  a  hierarchy  of  indus¬ 
try  sectors.  Using  all  classes  at  depth  two  results  in 
6440  web  pages  partitioned  into  71  classes.  In  tokeniz- 
ing  the  data  we  skip  all  MIME  headers  and  HTML 
tags,  use  a  stoplist,  but  do  not  stem.  After  removing 
tokens  that  occur  only  once,  the  corpus  contains  1.2 
million  words,  with  a  vocabulary  of  size  29964. 

The  Newsgroups  data  set,  collected  by  Ken  Lang,  con¬ 
tains  about  20,000  articles  evenly  divided  among  20 
UseNet  discussion  groups  [Joachims  1997].  Several  of 
the  topic  classes  are  quite  confusable:  five  of  them 
are  about  computers;  three  discuss  religion.  Prom  this 
data  set,  we  build  a  two-level  hierarchy  from  the  15 
classes  that  fit  into  the  following  top  level  categories: 
vehicles,  computers,  politics,  religion  and  sports.  We 
tokenize  the  data  in  the  same  way  as  above.  The  re¬ 
sulting  data  set,  after  removing  words  that  occur  only 
once,  contains  1.7  million  words,  and  a  vocabulary  size 
of  52309. 

We  gathered  the  entirety  of  the  Yahoo  ‘Science’  hierar¬ 
chy  in  July  1997.  The  web  pages  pointed  to  by  Yahoo 
are  divided  into  264  disjoint  classes  containing  14831 
pages  as  result  of  descending  to  deeper  nodes  of  Ya¬ 
hoo’s  hierarchy  until  each  class  contains  less  than  200 
documents,  and  then  removing  classes  with  fewer  than 
20  documents.  After  tokenizing  as  above  and  removing 
stopwords  and  words  that  occur  only  once,  the  corpus 
contains  3.0  million  words,  with  a  vocabulary  size  of 
76624. 

Feature  selection,  when  used,  is  performed  by  select- 
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Figure  2:  Classification  accuracy  on  the  Industry  Sector 
data  set  with  varying  vocabulary  size  in  the  horizontal  axis. 
The  tiny  vertical  bars  at  each  data  point  indicate  standard 
error.  Performance  is  best  with  the  full  vocabulary,  where 
shrinkage  reduces  error  by  almost  one-third. 


ing  the  words  that  have  highest  mutual  information 
with  the  class  variable.  A  previous  study  found  this 
method  to  be  the  best  for  text  among  several  com¬ 
mon  methods  [Yang  &  Pederson  1997].  In  addition  to 
selecting  features  by  the  traditional,  flat  use  of  mu¬ 
tual  information,  we  also  use  the  hierarchy  for  feature 
selection.  Hierarchical  feature  selection  selects  equal 
numbers  of  top  words  by  mutual  information  at  each 
internal  node  of  the  tree,  using  the  node’s  immedi¬ 
ate  children  as  the  classes.  This  corresponds  to  Koller 
and  Sahami’s  hierarchical  feature  selection  with  zero 
dependencies  [Koller  &  Sahami  1997],  except  that  we 
define  the  total  vocabulary  to  be  the  union  of  all  the 
vocabularies  chosen  by  the  internal  nodes.  The  union 
is  necessary  so  that  the  models  we  will  mix  share  the 
same  event  space. 

Hierarchical  classification  improves  accuracy 

Figure  2  shows  classification  accuracy  on  the  Indus¬ 
try  Sector  data  set  with  50-50  train-test  splits  while 
varying  vocabulary  size.  No  partial  credit  is  given  for 
classification  into  neighbors  of  the  true  class. 

First,  note  that  larger  vocabulary  sizes  generally  per¬ 
form  better;  this  is  consistent  with  previous  results  of 
naive  Bayes  on  several  other  data  sets  [Joachims  1997; 
Nigam  et  al.  1998;  McCallurn  &  Nigam  1998].  Sec¬ 
ond,  note  that  Hierarchical  Feature  Selection  some¬ 
what  improves  the  performance  of  flat  naive  Bayes 
in  the  mid-range  of  feature  selection — at  about  5000 
words,  traditional,  flat  feature  selection  obtains  59% 
accuracy,  while  hierarchical  feature  selection  reaches 


Figure  3:  Classification  accuracy  on  the  Newsgroups  data 
set  with  varying  amounts  of  training  data.  The  vertical 
axis  is  zoomed  for  magnification  of  the  error  bars.  Over¬ 
all,  hierarchical  modeling  provides  less  improvement  than 
it  does  in  the  Industry  Sector  data  set  because  the  hierar¬ 
chy  is  much  smaller.  Notice,  however,  that,  as  expected, 
shrinkage  helps  more  when  there  is  less  training  data. 

64%.  Third,  and  most  importantly,  observe  that 
shrinkage  improves  classification  accuracy  across  the 
board,  making  the  largest  improvement  at  the  full, 
unpruned  vocabulary  size,  where  it  achieves  76%  accu¬ 
racy.  In  comparison,  the  flat  classifier  reaches  its  best 
performance  of  66%  at  about  10000  words.  This  differ¬ 
ence  represents  a  29%  reduction  in  classification  error. 
We  maintain  that  low-frequency  words  contribute  sig¬ 
nificantly  to  correct  classifications,  and  that  shrinkage 
helps  reduce  variance  of  the  estimates  in  the  larger  pa¬ 
rameter  space  that  results  from  the  larger  vocabulary.^ 

Shrinkage  helps  more  when  training  data  is 
sparse. 

Figure  3  shows  accuracy  on  the  Newsgroups  data  set 
with  the  full  vocabulary,  while  varying  amount  of 
training  data.  Our  experiments  indicate  that  accuracy 
in  this  domain  is  highest  with  no  feature  selection,  (i.e. 
using  the  full  vocabulary),  for  both  flat  and  hierar¬ 
chical  classifiers,  even  with  small  amounts  of  training 
data. 

It  is  interesting  to  see  that  hierarchical  modeling  pro¬ 
vides  less  improvement  on  this  data  set  than  it  does 
in  the  Industry  Sector  corpus.  We  expect  that  this  is 

^Large  vocabularies  need  not  be  a  computational  con¬ 
cern.  In  our  experiments,  with  the  largest  vocabulary, 
it  takes  only  216  seconds  to  classify  3220  Industry  Sector 
documents  and  write  the  results  to  disk.  In  comparison, 
the  smallest  vocabulary  takes  208  seconds — a  difference  of 
0.002  seconds  per  document  on  average. 
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due  to  the  significantly  reduced  branch-out  factor  in 
this  smaller  hierarchy.  Unlike  the  Industry  Sector  hi¬ 
erarchy,  in  which  the  mean  number  of  siblings  is  six, 
here  the  mean  number  of  siblings  is  three.  Thus  each 
child  has  fewer  siblings  and  less  data  from  which  to 
“borrow  strength.” 

The  second  expected  result,  exhibited  in  Figure  3,  is 
that  shrinkage  provides  more  improvement  when  the 
amount  of  training  data  is  small,  and  that  shrinkage 
reduces  variance  in  the  classifications;  (notice  larger 
error  bars  on  the  ‘flat  classification’  curve).  K  each 
class  had  an  infinite  amount  of  training  data,  accurate 
parameter  estimates  could  be  obtained  for  each  class 
independently;  however,  when  training  data  is  sparse, 
estimates  are  improved  by  using  shrinkage  to  smooth 
a  class’s  parameters  with  its  ancestors. 

The  two  findings  that  (1)  shrinkage  allows  the  use  of 
helpful  large  vocabulary  sizes,  and  (2)  shrinkage  im¬ 
proves  performance  more  when  training  data  is  sparse, 
are  both  confirmed  by  our  experiments  with  the  Ya¬ 
hoo  data  set.  Figure  4  shows  classification  accuracy 
on  the  Science  hierarchy  as  a  function  of  vocabulary 
size,  again,  with  no  partial  credit  for  near  misses.  Flat 
naive  Bayes  reaches  its  highest  accuracy  of  36.4%  at 
a  relatively  small  vocabulary  size  of  1449.  Hierarchi¬ 
cal  classification  always  performs  better  than  flat,  but 
attains  its  best  accuracy  of  39.5%  at  a  larger  vocab¬ 
ulary  size  of  13311.  The  improvement  in  accuracy  is 
not  as  dramatic  here  as  with  the  Industry  Sector  data 
set,  perhaps  because  the  Yahoo  set  is  more  noisy  (be¬ 
ing  gathered  automatically  rather  than  by  hand,  and 
containing  many  documents  that  are  simply  timeout 
messages  or  pointers  to  moved  pages),  and  because 
Yahoo  has  many  classes  with  overlapping  or  closely 
neighboring  definitions.^  However,  it  is  interesting  to 
note  that  among  those  classes  with  small  quantities  of 
training  data,  shrinkage  improves  performance  more 
strongly.  Among  those  151  classes  with  50  documents 
or  less,  shrinkage  improves  accuracy  by  8%,  from  29% 
to  37%.  Among  those  50  classes  containing  more  than 
100  documents,  shrinkage  does  not  improve  accuracy, 
both  obtaining  about  45%. 

This  result  indicates  that  shrinkage  would  be  all  the 
more  important  if  we  attempted  to  classify  documents 
into  Yahoo’s  deepest  leaf  categories  instead  of  into  the 
somewhat  coalesced  and  pruned  version  that  is  used 

®Using  more  complex  Bayesian  classifiers  that  capture 
more  dependancies  than  naive  Bayes  may  help  this  last 
problem.  The  larger  number  of  paramters  in  these  models 
will  make  training  data  even  more  sparse,  and  this  suggests 
that  the  use  of  shrinkage  would  be  all  the  more  important. 
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Figure  4:  Classification  accuracy  on  the  Yahoo  Science  data 
set  with  varying  vocabulary  sizes.  Tiny  vertical  bars  at 
each  data  point  indicate  standard  error.  The  large  number 
of  classes  and  noisy  data  make  this  task  difficult. 


here  and  is  defined  at  the  beginning  of  this  section. 
However,  this  would  result  in  thousands  of  classes — 
quite  a  computational  burden.  Next  we  describe  how 
the  hierarchy  itself  can  be  used  to  eeise  this  burden. 

Pruning  the  tree  for  increased  computational 
efficiency 

In  addition  to  improving  accuracy,  the  class  hierarchy 
can  also  be  leveraged  to  improve  computational  effi¬ 
ciency.  The  classifier  can  avoid  calculating  P(cj|di) 
for  a  majority  of  the  classes  (leaves  of  the  tree)  by 
pruning  the  tree  dynamically  during  the  classification 
of  each  document.  Like  the  Pachinko  Machine  [Koller 
&  Sahami  1997]  we  can  classify  the  document  at  in¬ 
ternal  nodes  of  the  tree,  and  choose  only  to  calculate 
probabilities  for  classes  underneath  the  branches  se¬ 
lected  by  these  higher-level,  coarse-grained  classifiers. 
Note,  however,  that  when  we  do  this,  each  “pruning 
classification”  at  the  interior  of  the  tree  is  an  opportu¬ 
nity  for  error,  and  the  deeper  the  hierarchy  the  more 
the  opportunities  for  error  will  compound. 

As  expected,  our  experimental  results  show  that  per¬ 
forming  this  pruning  does  indeed  reduce  classification 
accuracy.  However,  one  may  be  willing  to  accept  this 
reduction  in  exchange  for  the  exponential  reduction  in 
the  amount  of  computation  necessary  for  classification. 
On  the  Industry  Sector  data  set,  averaged  over  ten  runs, 
pruning  that  removes  from  consideration  all  but  a  sin¬ 
gle  branch  at  each  interior  node  reaches  70.0%  accu¬ 
racy,  more  than  5%  points  lower  than  without  pruning. 
However,  unlike  the  Pachinko  Machine,  our  paradigm 
allows  for  the  comparison  of  classification  scores  from 
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leaves  that  do  not  share  the  same  parent.  Thus  we 
can  also  prune  less  aggressively.  Pruning  that  keeps 
two  branches  attains  74.3%.  And  pruning  to  three 
branches  achieves  75.2%.  This  last  result  is  only  half  a 
percent  less  than  the  75.8%  obtained  by  the  full  evalu¬ 
ation  of  the  tree  without  pruning.  The  same  approach 
could  also  be  used  for  Yahoo. 

5  Related  Work 

Shrinkage  estimation  is  now  considered  standard 
methodology  in  Statistics.  It  is  used  routinely  in  a  vast 
array  of  problems  and  its  theoretical  properties  have 
been  studied  from  both  the  Bayesian  and  frequentist 
points  of  view.  A  good  discussion  with  ample  refer¬ 
ences  and  examples  is  contained  in  [Carlin  &  Louis 
1996].  Although  MacKay  and  Peto  [1994]  do  not  use 
the  term  “shrinkage”  in  their  paper,  they  apply  this 
Bayesian  style  of  shrinkage  in  their  hierarchical  Dirich- 
let  model  for  n-grams. 

Shrinkage  in  the  cross-validation  style  was  first  used 
to  derive  a  language  model  in  [Jelinek  &  Mercer  1980], 
where  it  is  known  as  deleted  interpolation.  Interpola¬ 
tion  of  language  models  along  the  path  of  a  tree  is  de¬ 
scribed  in  [Bahl  et  al.  1989].  More  recently,  Seymore 
and  Rosenfeld  [1997]  classified  a  speech  recognizer’s 
output  into  multiple  topics,  then  used  an  automati¬ 
cally  derived  “topic  tree”  to  interpolate  the  models 
associated  with  appropriate  nodes  up  that  tree. 

A  variety  of  work  in  the  Information  Retrieval  and 
Machine  Learning  communities  has  demonstrated  the 
success  of  statistical  approaches  for  learning  to  classify 
text  documents.  Naive  Bayes  has  been  used  for  text 
classification,  and  due  to  its  probabilistic  foundations, 
been  applied  in  several  extensions  [Lewis  &  Ringuette 
1994;  Joachims  1997;  Nigam  et  al.  1998]. 

An  earlier  approach  to  hierarchical  document  classi¬ 
fication,  the  Pachinko  Machine^  has  been  proposed 
by  Roller  and  Sahami  [1997].  Their  method  differs 
significantly  from  shrinkage.  The  Pachinko  Machine 
classifies  documents  at  internal  nodes  of  the  tree,  and 
greedily  selects  sub-branches  until  it  reaches  a  leaf. 
Since  classification  errors  at  internal  nodes  compound, 
the  accuracy  at  all  the  internal  nodes  must  be  very 
high  in  order  for  overall  accuracy  to  be  higher  than 
a  flat  classifier  (especially  for  deeper  hierarchies).  We 
experimented  with  schemes  that  allow  a  lower  node 
to  “reject”  a  document  and  send  it  back  up  the  tree 
for  re-classification,  but  did  not  find  these  to  work 
well.  Roller  and  Sahami  present  results  with  small 
vocabularies  (less  than  100  words);  however,  other 


results  in  the  literature  indicate  that  large  vocabu¬ 
lary  sizes  often  have  higher  accuracy  [Joachims  1997; 
Nigam  et  al.  1998].  A  possible  explanation  for  the 
discrepancy  is  that  Roller  and  Sahami  use  a  multi¬ 
variate  Bernoulli  model  while  we  use  a  multinomial 
model  [Sahami,  Personal  Communication].  In  our  ex¬ 
periments  we  have  found  multinomials  to  outperform 
Bernoullis  [McCallum  &  Nigam  1998].  Our  use  of 
shrinkage  has  allowed  us  to  more  robustly  keep  large 
vocabulary  sizes,  which  we  believe  are  necessary  for 
classifying  large  data  sets  with  large  numbers  of  di¬ 
verse  classes. 

Another  learning  method  that  uses  EM  to  set  mixture 
weights  among  ancestors  in  a  hierarchy  is  Adaptive 
Mixtures  of  Probabilistic  Transducers  [Singer  1997]. 
Each  node  in  a  hierarchy  that  represents  a  history- 
window  is  linearly  mixed  with  its  parent,  which  in 
turn,  is  mixed  with  its  parent.  The  model  is  applied 
with  success  to  noun  phrase  recognition. 

Hofmann  and  Puzicha’s  [1998]  Hierarchical  Asymet- 
ric  Clustering  Model  (HACM)  performs  unsupervised 
clustering  with  a  mixture  model  in  which  EM  is  also 
used  to  set  weights  among  the  ancestors  in  a  hierarchy. 

6  Conclusions 

This  paper  has  examined  the  use  of  class  hierarchies  for 
improving  text  classification.  As  the  amount  of  on-line 
text  increases  and  the  number  of  topic  categories  into 
which  it  is  organized  grows,  hierarchies  are  becoming 
an  increasingly  prevalent  way  to  make  a  collection  of 
categories  manageable.  Thus,  the  need  for  good  text 
classification  algorithms  that  take  advantage  of  these 
hierarchies  becomes  more  important. 

In  this  paper  we  demonstrate  that  shrinkage  with  a 
class  hierarchy  improves  parameter  estimation,  and 
can  reduce  text  classification  error  by  up  to  29%.  Be¬ 
cause  shrinkage  helps  especially  when  there  is  sparse 
training  data,  shrinkage  should  be  all  the  more  benefi¬ 
cial  cis  we  scale  up  to  larger,  higher-resolution,  deeper 
hierarchies  with  more  classes  that  require  larger  vo¬ 
cabularies. 

We  also  show  that  a  class  hierarchy  can  be  used  to 
exponentially  reduce  the  amount  of  computation  re¬ 
quired  to  classify  documents,  and  that  we  can  do  so 
without  sacrificing  significant  classification  accuracy. 

In  future  work,  we  will  investigate  the  use  of  shrinkage 
to  learn  more  complex  Bayesian  models  with  less  re¬ 
strictive  assumptions  than  naive  Bayes.  The  improve¬ 
ments  due  to  shrinkage  should  be  incre^Lsingly  strong 
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as  we  move  to  models  that  have  more  parameters,  and 
thus  sparser  training  data.  We  will  also  explore  alter¬ 
native  methods  of  shrinkage,  including  the  Bayesian 
methods  in  the  style  of  James  and  Stein.  We  plan  to 
work  with  a  related  approach  that  uses  EM  to  clus¬ 
ter  the  data  in  a  parent,  and  then  allows  the  child  to 
mix  with  the  different  clusters  independently.  In  other 
ongoing  work  we  are  studying  the  advantages  of  using 
EM  not  only  to  set  the  mixture  weights,  but  also  re¬ 
distribute  individual  words  of  training  data  among  the 
nodes  on  the  path  from  the  leaf  to  the  root. 

Lastly,  we  plan  to  explore  ways  to  learn  the  class 
hierarchy — investigating  methods  that  specifically  aim 
to  increase  classification  accuracy.  In  early  experi¬ 
ments,  it  appears  that  when  the  learner  is  not  explic¬ 
itly  given  a  hierarchy,  then  even  using  the  “trivial” 
hierarchy  (each  class  being  a  leaf  off  the  root)  does 
better  than  the  flat  classifier,  though  not  as  well  as 
when  we  are  given  a  “non-trivial”  hierarchy.  Further¬ 
more,  using  a  “bad”  or  scrambled  hierarchy  also  does 
better  than  the  flat  classifier — the  mixture  weights  are 
set  by  EM  to  mimic  the  trivial  hierarchy. 
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Abstract 

Research  emanating  from  Artificial  Intelli¬ 
gence  has  throughout  its  history  contributed 
to  techniques  and  ideas  in  Software  Engineer¬ 
ing.  We  describe  in  this  paper  a  case  study 
showing  the  use  of  theory  revision  to  the  re¬ 
finement  of  a  formally  specified  requirements 
model.  In  a  previous  project  we  were  con¬ 
tracted  to  create  a  precise  model  of  the  com¬ 
plex  criteria  governing  the  separation  of  air¬ 
craft  profiles  in  Atlantic  Airspace.  During 
that  work  it  became  clear  that  the  (auto¬ 
mated)  validation  of  the  model  was  of  the  ut¬ 
most  importance,  and  in  our  current  project 
we  have  used  machine  learning  tools  to  pro¬ 
vide  extra  support  in  bug  identification,  bug 
removal  and  maintenance  of  such  a  require¬ 
ments  model.  In  this  paper  we  give  an 
overview  of  the  domain,  identify  a  relevant 
learning  bias  which  makes  search  for  revi¬ 
sions  tractable,  and  describe  a  systematic  ap¬ 
proach  for  the  application  of  theory  revision 
to  such  a  model.  We  illustrate  the  approach 
with  results  of  experiments  where  theory  re¬ 
vision  techniques  have  identified  and  removed 
errors,  and  induced  a  new  part  of  the  model. 

Keywords  Theory  Revision,  Machine  Learning  and 
Software  Engineering,  Requirements  Model,  Auto¬ 
mated  Validation. 

1  INTRODUCTION 

Promoting  and  maintaining  the  quality  of  require¬ 
ments  specifications  has  a  vital  role  in  the  engineer¬ 
ing  of  software.  Some  software  projects,  such  as  those 
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involving  safety-critical  elements,  necessitate  that  pre¬ 
cise,  mathematical  specifications  of  their  requirements 
domains  be  constructed.  Such  ‘requirements  models’ 
must  be  validated  to  satisfy  certain  major  quality  ob¬ 
jectives  such  as  accuracy,  completeness,  usability,  and 
understandability,  and  during  the  model’s  lifetime  it 
is  likely  to  be  incrementally  updated,  and  will  require 
re- validation.  Validation  and  maintenance  of  realistic 
domain  models  is  a  very  time  consuming,  expensive 
process  where  the  role  of  support  tools  in  vital.  The 
process  is  best  carried  out  using  diverse  techniques, 
and  one  of  the  most  useful  techniques  is  to  test  an  an¬ 
imated  form  of  the  model.  Even  when  an  animated 
version  is  available,  however,  it  is  not  ea.sy  to  pinpoint 
the  causes  of  bugs  and  subsequently  provide  the  cor¬ 
rect  revision  that  eliminates  them. 

In  this  work  we  view  a  precise  requirements  model 
as  an  imperfect  theory  of  the  requirements  domain 
that  needs  to  undergo  refinement  to  remove  bugs  or 
to  reflect  changes  in  the  domain,  and  we  formulate 
the  problem  as  one  of  theory  revision.  The  case  study 
uses  an  air  traffic  control  requirements  model  devel¬ 
oped  in  a  previous  project  called  FAROAS  (McCluskey 
et  al.  1995).  The  model  represents  aircraft  sepa¬ 
ration  criteria  and  conflict  prediction  procedures  re¬ 
lating  to  airspace  over  the  North  East  Atlantic,  and 
is  recorded  in  the  ‘Formal  Methods  Europe  Applica¬ 
tions  Database’^.  The  model’s  ‘conventional’  support 
environment  had  been  used  for  verification  and  vali¬ 
dation  of  models  written  as  a  set  of  axioms  in  many 
sorted  first  order  logic  (Meinke  and  Tucker  1993)  - 
here  abbreviated  to  msl.  During  the  current  IMPRESS 
project  we  extended  the  environment  to  include  ma¬ 
chine  learning  tools  which  perform  blame  assignment, 
explanation-based  generalisation  and  theory  revision 
(TR).  We  show  in  this  paper  how  we  overcame  the  in- 
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tractability  problems  in  fielding  TR  by  firstly  focusing 
on  likely  faulty  axioms  sets  using  a  blame  assignment 
algorithm,  then  targeting  for  revision  the  ordering  re¬ 
lations  between  values  of  ordinal  sorts.  We  describe  a 
method  and  a  class  of  revision  operator  that  has  been 
successfully  used  to  (a)  find  and  remove  bugs  from  the 
requirements  model,  and  (b)  to  construct  a  new  part  of 
the  model  to  cope  with  the  changing  of  criteria  for  ver¬ 
tical  separation  between  subsonic  aircraft.  Thus  TR 
can  be  seen  as  a  useful  embedded  component  within  a 
requirements  validation  regime  for  high  integrity  sys¬ 
tems. 

2  THE  ATC  DOMAIN 

2.1  DOMAIN  DESCRIPTION  AND 
ACQUISITION 

‘Shanwick’  is  a  large  area  of  airspace  in  the  eastern 
half  of  the  North  Atlantic,  managed  by  air  traffic  con¬ 
trol  centres  in  Shannon,  Ireland  and  Prestwick,  Scot¬ 
land.  Controllers  must  organise  this  airspace  daily, 
taking  into  account  such  factors  as  weather  and  the  de¬ 
sired  flight  paths  of  aircraft  companies.  They  plan  the 
four  dimensional  flight  profiles  of  aircraft  crossing  this 
airspace  in  good  time  before  the  aircraft  reaches  the 
boundary,  and  for  this  task  require  a  precise  definition 
of  aircraft  separation  criteria,  and  an  algorithm  for 
predicting  conflicts.  The  controllers  are  supported  in 
their  safety-critical  work  by  a  computer  system  which 
performs  predication  and  resolution  of  conflicts  be¬ 
tween  pairs  of  flight  profiles,  and  our  involvement  came 
about  as  part  of  the  research  and  development  con¬ 
cerning  the  requirements  specification  of  a  replacement 
for  their  current  flight  data  processing  system. 

In  the  FARO  AS  project,  we  created  a  precise  require¬ 
ments  model  (called  the  CPS)  of  the  conflict  predic¬ 
tion  of  aircraft  flight  profiles  through  the  Shanwick 
airspace,  together  with  a  software  support  environ¬ 
ment.  Knowledge  sources  used  were  manuals  of  air 
traffic  control,  existing  computer  systems  documenta¬ 
tion,  and  air  traffic  control  officers  themselves.  The 
current  CPS  contains  a  kernel  of  300  -  400  axioms 
in  msl  representing  aircraft  profile  separation  criteria 
and  a  conflict  prediction  method;  the  total  number  of 
axioms  in  an  instance  of  the  model,  which  includes 
airspace  and  short  term  flight  information  for  a  day’s 
set  of  profiles,  exceeds  two  thousand.  The  model  is 
structured  into  23  sorts,  and  is  enriched  with  real  and 
natural  numbers.  An  example  of  an  axiom  in  the  CPS 
is  provided  in  Figure  1.  This  represents  the  condition 
for  a  vertical  separation  of  2,000  feet,  where  segments 


(Segment 1  and  Segment2 
are_subject_to_oceanic_cpr)  => 

[(the_min_vertical_sep_Val_in_feet_required_for 
Flight.levell  of  Segmentl 
and  Flight_level2  of  Segment2)  =  2000  <=> 

[[(both  Segmentl  and  Segment2 
are_f lown_at_subsonic_speed) 

&  (one_or_both_of  Flight_levell  eind 
Flight_level2  are_above  FL  290)  ]  or 
[(one_or_both_of  Segmentl  zuid  Segment2 

are_f lown_at_supersonic_speed)  ft 
(one_or_both_of  Flight_levell  and 
Flight_level2  are_at_or_below  FL  430)  ]  ]  ] 

Figure  1:  Condition  for  a  Minimum  Vertical  Separa¬ 
tion  of  2000  feet 

are  roughly  ‘straight’  components  of  an  aircrafts  pro¬ 
file.  Either  the  two  aircraft  are  both  subsonic  and  are 
flying  above  FL  290  (29,000  feet)  or  one  or  both  are 
supersonic  and  are  flying  at  or  below  FL  430. 

2.2  A  CONVENTIONAL  SUPPORT 
ENVIRONMENT 

The  CPS  is  highly  structured,  with  axioms  containing 
very  complex  conditions,  but  the  support  of  an  in¬ 
tegrated  tools  environment  alleviates  its  analysis  and 
manipulation.  In  the  FAROAS  project  diverse  valida¬ 
tion  was  carried  out  using  tight  syntactic  checking,  se¬ 
mantic  internal  consistency  checks,  expert  inspection, 
simulation  and  batch  testing.  The  most  complex  tool 
in  the  environment  is  a  translator  program  which  in¬ 
puts  the  CPS  (or  more  generally  a  set  of  wffs  in  msl), 
together  with  a  syntactic  definition  of  the  tailored  msl 
language  expressed  in  grammar  rules.  It  parses  the 
wffs  and  outputs  an  animation  of  them  by  translating 
them  into  what  we  call  ‘EF’  (execution  form).  This 
is  similar  to  general  clausal  form,  except  clauses  may 
contain  nested  negation  and  disjunction  in  their  bod¬ 
ies.  EF  obeys  the  syntax  rules  of  Prolog  and  is  ex¬ 
ecutable  by  a  Prolog  interpreter.  This  parsing  and 
translation  process  takes  less  than  5  minutes  for  all  of 
the  CPS,  and  its  translated  form  we  term  CPSef 

Flight  profiles  are  input  to  the  software  environment 
as  msl  axioms  and  are  translated  into  EF.  Although 
in  theory  any  part  of  the  CPS  can  be  tested,  virtually 
all  of  the  instances  we  obtained  were  for  the  ‘top  level’ 

^all  software  tools  reported  in  this  paper  are  imple¬ 
mented  in  Sicstus  Prolog  and  were  tested  using  a  SUN 
SPARC  station  4  processor  with  32MB  memory 
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conflict  axiom  defining  the  mixfix  conflict  predicate: 

SegmentX  of  Profilel  and  SegmentY  of  Profile2 
areJn_oceanic_conflict 

A  day’s  worth  (500  -  800)  of  cleared  aircraft  profiles, 
where  each  profile  is  cleared  with  (say)  the  last  20 
cleared  aircraft  in  chronological  order,  results  in  ap¬ 
proximately  10,000  instances  classified  as  false  for  the 
conflict  axiom,  where  the  SegmentX  and  SegmentY 
are  existentially  quantified  variables  representing  seg¬ 
ments  of  an  aircrafts  profile. 

In  the  rest  of  this  paper  we  use  the  following  nota¬ 
tion:  a  classified  instance  that  is  labelled  true  and 
which  CPSef  classifies  as  true  is  called  truly  positive 
(‘TP’),  and  denoted  whereas  one  that  executes 
to  false  is  called  falsely  negative(‘FN’)  and  denoted 

.  A  classified  instance  labelled  false  which  executes 
to  false  using  CPSef  is  called  truly  negative  (‘TN’), 
whereas  one  that  executes  to  true  is  called  falsely  pos¬ 
itive  (‘FP’);  these  are  denoted  e™  and  respec¬ 
tively.  Early  phases  of  validation  during  the  FAROAS 
project  involving  syntax  checking  and  painstaking  ex¬ 
pert  inspection  increased  the  accuracy  and  complete¬ 
ness  of  the  CPS  so  that  dynamic  testing  of  the  con¬ 
flict  axiom  resulted  in  a  large  number  of  TN,  with  a 
smaller  (about  5  per  cent)  but  significant  number  of 
FP.  Although  investigation  of  the  set  FP  helped  to  find 
bugs,  it  became  clear  that  more  powerful  tools  for  bug 
identification  and  removal  were  needed  when  building 
up  and  maintaining  such  a  complex,  precise  domain 
model. 

3  APPLICATION  OF  THEORY 
REVISION 

3.1  RATIONALE 

The  principle  objectives  of  the  current  project,  IM¬ 
PRESS,  were  to  test  the  use  of  ML  to  help  improve 
the  quality  (in  terms  of  accuracy  and  completeness)  of 
a  formalised  requirement  specification  written  in  msl 
and  to  increase  the  quality  of  the  CPS  itself.  The  focus 
was  not  only  on  bug  removal  but  also  on  maintenance, 
to  support  the  inevitable  changes  in  the  requirements 
model.  Since  we  started  with  an  existing  symbolic  do¬ 
main  model,  the  principle  ML  paradigm  we  decided 
to  use  was  theory  revision  (Wrobel  1996).  Our  initial 
formulation  was  as  follows: 

Revisable  theory:  a  subset  of  CPSef  clauses. 
We  can  keep  some  parts  of  the  CPSef  immune  or 
‘shielded’  from  the  revision  process,  as  they  were  ad¬ 


equately  validated  using  other  processes.  For  exam¬ 
ple,  it  may  be  assumed  that  the  ‘top  level’  axioms, 
i.e.  those  defining  the  basics  of  separation  in  terms  of 
vertical  and  horizontal  dimensions,  are  correct.  The 
target  concept  is  the  conflict  predicate  shown  above. 

Training  Instances:  The  main  source  is  a  day’s 
worth  of  cleared  flight  profiles  supplied  directly  by  the 
UK  National  Air  Traffic  Services.  The  conflict  predi¬ 
cate  can  be  executed,  and  when  instantiated  with  pairs 
of  cleared  flight  profiles  should  return  false.  The  na¬ 
ture  of  the  application  skews  the  training  somewhat  as 
it  is  driven  by  FPs  only.  However,  experiments  have 
also  been  conducted  with  other,  lower-level  predicates 
as  target  concepts,  such  as  those  involved  in  vertical 
conflict.  Instances  associated  with  these  conflicts  arc 
classified  into  FNs  and  TPs  as  well  as  TNs  and  FPs. 

Learning  Biases:  the  language  used  for  the  CPSef 
is  strongly  typed,  which  provides  a  useful  constraint  in 
the  generalising  or  specialising  of  predicates.  Also  we 
assume  a  minimal  revision  bias:  we  know  from  other 
forms  of  validation  that  its  structure  mirrors  the  re¬ 
quirements  domain,  and  so  we  assume  only  minimal 
revisions  are  necessary. 

Given  the  general  problem  outlined  above,  we  imple¬ 
mented  a  standard,  simple  TR  algorithm  with  opera¬ 
tors  such  as  ‘add  antecedent’  and  ‘delete  clause’.  How¬ 
ever  we  only  confirmed  that  a  ‘mainstream’  approach 
to  TR  would  be  impracticable.  Even  given  the  biases, 
the  potential  space  of  revisions  is  enormous,  and  ‘hill¬ 
climbing’  with  traditional  TR  operators  appears  out 
of  the  question.  The  CPSef  executes  the  conflict  ax¬ 
iom  at  an  average  rate  of  about  one  test  per  minute 
and  results  in  a  batch  of  tests  taking  perhaps  days  to 
execute! 

We  also  investigated  using  TR  tools,  available  via  ftp, 
but  came  to  the  conclusion  that  we  would  need  to  build 
our  own  environment  (West  et  al.  1996).  This  was 
based  on  the  need  for  a  flexible  tool  base  given  we 
were  embarking  on  a  research  project,  and  the  need 
for  tool  integration,  particularly  with  our  existing  val¬ 
idation  tools  from  the  earlier  FAROAS  project.  More¬ 
over,  the  existing  tools  we  examined  were  not  powerful 
enough  for  our  use.  For  example,  FORTE  (Richards 
and  Mooney  1995),  though  well  tested,  could  not  cope 
with  negation  or  functors.  Both  the  latter  are  impor¬ 
tant  features  of  the  CPS.  Also,  while  tools  presented  in 
the  literature  had  been  tested  on  theories  of  the  order 
of  lO’s  of  predicates  calls  within  a  similar  number  of 
non-atomic  clauses,  the  CPSef  contains  c. 2,000  pred¬ 
icate  calls  within  more  than  300  non-atomic  clauses. 
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3.2  ORDINAL  SORTS 

The  key  to  our  approach  lay  in  the  introduction  of  a 
further  bias.  Although  the  sorts  comprising  universes 
of  objects  are  distinct,  each  sort  can  be  characterised 
as  either  ordered  or  not  (Birkhoff  1967).  The  sorts 
which  are  ordered  are  termed  ordinal  in  this  paper, 
and  those  which  are  not  are  termed  nominal.  Associ¬ 
ated  with  each  ordinal  sort  X  is  an  arbitrary  binary, 
transitive,  ordering  relationship  we  call  Exam¬ 
ples  of  ordinal  sorts  are  Flight  Level,  Time  and  Lati¬ 
tude,  where  primitive  order  relations  are  for  example 
‘is  above’,  ‘is  later  than’,  ‘is  west  of’.  Examples  of 
nominal  sorts  are  Aircraft,  Airspace,  Segment,  Profile. 
Technical  specifications  such  as  the  CPS  include  many 
references  to  ordinal  sorts,  and  our  experience  in  the 
validation  phase  had  shown  that  very  often  clauses  in¬ 
volving  comparisons  and  limits  were  to  blame.  For 
example,  of  the  17  primitive  order  relations  defined  in 
the  GPS’s  grammar,  there  are  204  occurrences  of  them 
within  the  current  version  of  the  CPS. 

Each  axiom  in  the  CPS  has  as  its  variable  domain; 


XiX  ...X  XnX  DiX  ...X  Dm,  n,  m  >  0. 


where  each  Xi  is  an  ordinal  sort  and  each  Dj  is  a 
nominal  sort.  We  will  focus  on  axioms  containing 
examples  are  x  y  a,x\  ^  X2,  where  x,xi,X2  are 
ordinal  variables  and  a  a  constant,  limiting  value  of 
some  appropriate  sort.  The  axiom  involving  ordering 
might  be  an  equation  defining  a  function,  T  which 
returns  different  values  for  different  subsets  of  its  do¬ 
main  XiX  ...X  XnX  DiX  ...X  Dm,  or  a  predicate, 
V.  The  statements  involved  in  the  definition  of  V  re¬ 
turn  ‘true’  or  ‘false’  for  different  subsets  of  its  domain 
XiX...xXnXDiX...x  Dm-  If  we  factor  out  the  Xi 
from  the  Dj  components,  for  each  main  predicate  and 
function,  for  each  tuple  {d\,. . .  ,dm)  of  values,  there  is 
defined  an  n  dimensional  region  TZ{di, . . .  ,dm)  -  the 
domain  of  applicability  of  the  predicate  or  function. 
For  the  remainder  of  the  paper,  we  shall  shorten  this 
to  TZ. 

When  the  CPS  is  translated  to  executable  form,  each 
axiom  becomes  a  Prolog  clause.  The  regions  described 
above,  for  the  main  axioms,  now  become  regions  for 
Prolog  clauses,  where  tuples  of  variables  now  become 
tuples  of  Prolog  variables.  In  the  case  where  a  wff  is 
an  equation,  its  domain  is  extended  by  the  returned 
term. 


3.3  SIMPLE  REVISIONS 

Given  a  concept  (for  example  the  conflict  predicate) 
and  a  set  of  positive  instances  of  the  concept,  trans¬ 
lated  to  EF,  then  the  set  of  proof  trees  of  the  in¬ 
stances  involve  a  set  of  clauses.  Consider  a  clause  C 
from  this  set,  where  its  (Prolog)  variables  are  the  tu- 
pleXl,  ...  ,Xn,  Dl,  ...  ,Dn,  where  the  X’s  and  D’s 
are  ordinal  and  nominal  respectively  (where  n  >  0). 
Each  instance  of  C  is  associated  with  an  n-tuple  of 
ordinal  variables  (a:i , . . . ,  Xn)  =  x.  We  should  expect 
positive  instances  to  have  x  E  TZ.  The  region  TZ  is 
defined  by  logical  expressions  5(x)  involving  ordinal 
variables  x  and  is  not  necessarily  connected.  In  a  sim¬ 
ilar  manner  a  clause  C  which  does  not  succeed  and 
which  is  involved  in  a  failed  proof  tree  (or  trace)  of 
a  negative  instance  will  have  x  ^  TZ.  In  order  for  in¬ 
stances  to  fail  where  they  previously  succeeded,  and 
vice-versa,  then  region  TZ  is  revised  to  become  region 
TZ',  for  clause  C.  We  classify  revision  operators  that 
may  change  a  clause  containing  an  ordinal  literal  into 
two:  simple  and  composite.  Simple  operators  involve 
deletion  and  addition  of  antecedents  from  a  clause,  as 
in  conventional  TR,  although  the  antecedents  are  re¬ 
stricted  to  occurrences  of  order  relations  This  kind  of 
operator  is  mainly  for  finding  and  possibly  correcting 
bugs  in  the  model.  For  example,  the  condition  x  y  y 
may  be  either  removed  or  replaced  hj  y  y  x.  This 
latter  is  akin  (in  2-D  geometrical  terms)  to  examining 
reflections  of  the  region  about  a  straight  line. 

3.4  COMPOSITE  REVISIONS 

The  second  kind  of  revision  is  designed  to  clarify 
requirements  involving  complex  conditions  involving 
limiting  values,  which  might  not  have  been  captured 
initially  from  the  expert  sources,  and  also  to  cope  with 
changing  requirements.  We  first  deal  with  specialisa¬ 
tion.  Suppose  (7  to  be  a  candidate  for  revision,  or 
revision  point,  where  C  contains  antecedents  of  the 
form  X  y  a,  a  &  constant.  Further,  suppose  C  suc¬ 
ceeds  with  instances  9,0  in  proofs  of  some  training 
instances  ef^  G  FP,  with  tuple  Xi  the  ordinal  vari¬ 
ables  of  OiC.  Suppose  also  that  C  is  successful  with 
instances  (pjC  in  proof  tree  of  training  instances  ej^\ 
the  tuples  yj  are  the  ordinal  variables  of  (fjC.  Failure 
of  C  would  ensure  the  removal  of  some  instances  ef  ^ 
and  in  order  that  C  should  fail,  we  need  to  revise  TZ 
to  TZ'. 

Vxi  ;  Xi  ^  TZ'. 

However,  in  order  that  C  should  safely  succeed  for 
correctly  classified  instances,  then  tuples  x  associated 
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with  e  should  not  be  removed  from  Ti.  Thus 

Vyj :  yj  €  n'. 

We  calculate  the  two  sets  of  tuples: 

gFP  _  I  Xj  ordinal  variables  of  0iC}, 

gTP  _  I  yj  ordinal  variables  of  (1) 

This  allows  for  the  fact  that  the  mis-classification  of 
some/all  of  the  instances  may  have  arisen  from 
another  clause.  Recalling  that  the  variables  of  C  are 
X  =  xi . .  .Xn,  we  denote  the  minimum  and  maximum 
values  of  variable  component  Xi  of  the  variables 
by  maxf^  respectively.  In  a  similar  manner 

the  minimum  and  maximum  values  of  components  of 
the  5^^  variables  are  respectively  Tninf^,max^^. 

We  induce  the  following,  that  for  instances  6i  C  to  fail, 
the  new  specialised  region  is  72.  less  an  n  dimensional 
interval  TZpp  bounded  by  min[^  ,max[^ .  We  have 

Tlpp  =  {(a^i  •  •  •  Xn)  I  minf^  ^  h  Tuaxf^  A 
...  A  min^^  ^  a:„  V  max^^}  (2) 

However  for  instances  (l>jC  to  succeed,  72'  must  in¬ 
clude  an  n  dimensional  interval  72  tp  bounded  by 

•  TP  TP 

min^  ^ ,  max^  ^ : 

72rp  =  {(a;i . . .  a:„)  |  min^^  h  xi  ^  max^^  A 
...  A  min^^  h  Xn  h  (3) 

We  have 

72'  =  (72  \  TZfp)  U  72tp  (4) 

{TZtp  =  72f’P  is  the  limiting  case,  where  all  of  the  mis- 
classified  instances  have  arisen  from  another  clause.) 
In  order  to  accomplish  the  revision,  we  specialise  the 
clause  C  as  follows:  every  occurrence  in  the  unrevised 
body  of  C  of  the  logical  expression  £{xi . .  .Xn),  should 
be  replaced  in  the  revised  body  of  C  by  S' {xi . .  .x„), 
which  is  defined: 

(5(3:1 . . .  Xn)  A  h  %  h  Jnaxf^  A 

...  A  min^^  h  Xn  h  iP-fix^^)) 

V  t  b  Pnax^^ 

...  A  min^^  h  Xn  h  max^^)  (5) 

Generalisation  can  be  explained  in  a  similar  manner: 
in  order  for  instances  e^^  to  succeed,  their  x  compo¬ 
nents  are  added  to  the  region.  However  instances  e™ 


must  still  fail.  We  calculate  sets  ,  S™  and  re¬ 
gions  TIfn na  an  analogous  manner  to  5^^ 
in  (1)  and  'R-fp,T^tp  in  (2). 

We  induce  the  following,  that  for  some  FN  instances 
to  succeed  and  all  the  TN  instances  to  fail,  then  the 
new  generalised  region  is 

Tl' —  {TZL)TZfn)\T^tn  (6) 

In  order  to  accomplish  this,  we  generalise  the  clause 
C,  so  that  every  occurrence  in  the  unrevised  body  of 
C  of  the  logical  expression  S{xi . .  .Xn),  should  be  re¬ 
placed  by  S'(xi . . .  Xn),  in  an  analogous  manner  to  (5). 
In  the  next  section  we  explain  how  these  simple  and 
composite  revisions  were  applied  to  CPSef  ■ 

4  EXPERIMENTS  WITH  TR 
TOOLS 

We  report  experiments  involving  two  kinds  of  data  set: 

1.  The  first  data-set  consists  of  training  instances 
from  a  day’s  cleared  flight  profiles  recorded  in 
January  1995.  This  data  was  used  with  the  ob¬ 
ject  of  testing  our  current  techniques  using  ‘sim¬ 
ple’  ordinal  operators.  When  tested,  the  errors  in 
the  CPS  as  measured  by  this  training  set  were  33 
in  5070,  having  been  previously  reduced  by  other 
techniques.  Use  of  TR  with  simple  operators  fur¬ 
ther  reduced  the  errors  to  1  in  5070. 

2.  ‘Reduced  separation  for  vertical  minima’  (RVSM) 
criteria  have  recently  been  introduced  for  certain 
types  of  aircraft  in  North  Atlantic  airspace.  A 
days  cleared  flight  profiles  were  provided  (from 
April  1997),  where  clearance  is  subject  to  the  new 
revised  criteria  (of  flight  levels)  for  vertical  sepa¬ 
rations  for  pairs  of  aircraft.The  new  criteria  in¬ 
volved  flight  level  intervals  for  both  aircraft  and 
was  not  captured  by  our  current  theory.  ‘Simple’ 
ordinal  operators  were  not  suitable  for  revisions 
of  the  type  investigated,  so  this  data-set  was  used 
for  independently  testing  ‘composite’  ordinal  op¬ 
erators.  After  the  CPS  was  revised  using  simple 
operators,  it  was  then  re-revised  using  training  in¬ 
stances  from  post-RVSM  data  and  composite  op¬ 
erators.  All  121  errors  resulting  from  the  data 
cleared  by  the  changed  separation  standard  were 
eliminated  by  the  method. 

4.1  THE  METHOD  AND  RESULTS 

The  method  shown  here  is  a  general  one  for  revision 
of  a  theory,  P,  containing  significant  ordinal  variables. 
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although  it  is  based  on  experiments  with  the  CPS.  We 
have  implemented  TR  in  a  manner  based  on  the  ‘geo¬ 
metrical’  discussion  above,  and  integrated  it  into  our 
legacy  software  environment  (McCluskey  et  al.  1995). 
This  architecture  has  had  to  be  both  flexible  and  ex¬ 
perimental  in  response  to  the  inherent  complexity  of 
formal  specifications  of  the  size  and  expressiveness  of 
the  CPS.  For  example,  we  have  had  to  develop  a  form 
of  blame  assignment  that  can  cope  with  proof  trees 
from  general  clausal  form  logic  programmes.  This  in¬ 
volves  first  unfolding  and  then  transforming  negative 
literals  using  De  Morgan’s  laws  and  is  detailed  in  (West 
et  al.  1997). 

4.1.1  An  Error  Removal  Experiment 

The  algorithm  for  simple  ordinal  revisions  is  based  on 
conventional  theory  revision  techniques  and  used  hill¬ 
climbing  based  on  the  accuracy  of  the  theory  P.  It  is 
shown  in  Figure  2.  The  potential  of  a  clause  C  is  the 
number  of  instances  in  which  it  succeeds  in  a  proof 
tree,  and  the  negative  potential  is  the  number  of  in¬ 
stances  in  which  it  fails  in  a  proof  trace;  this  notion  was 
used  in  the  description  of  the  FORTE  tool  (Richards 
and  Mooney  1995). 

1.  Collect  training  set  of  instances  of  a  concept,  L, 

known  to  contain  misclassified  instances. 

2.  Classify  training  instances  into  TNs,  FPs,  FN’s, 

TPs  and  calculate  accuracy. 

3.  Run  blame  assignment  on  instances  in  FN  giving 

set  of  potential-pairs,  PI. 

PI  =  {(C',1V)  I  C  revisable  clause  in  P  and 
N  is  the  potential  of  C  } 

Find  subset  OPl  of  PI: 

OPl  =  {{C,N)  I  {C,N)  e  PI  A  (7  contains  an 
ordinal  relation  }. 

4.  Repeat  step  3  for  instances  in  FP  giving  set  of 

potential-pairs,  P2,  and  subset  OP2. 

Let  OP  =  OPl  U  OP2. 

5.  Revision  points  =  {C  |  {C,N)  €}  OP. 

Apply  each  simple  TR  operator  to  each  revision 
point,  in  order  of  C  with  largest  potential.  Im¬ 
plement  the  best  revision. 

6.  Repeat  from  step  2,  unless  a  maximum  accuracy 

has  been  reached. 


Using  a  day’s  worth  of  training  instances  (cleared  flight 
profiles)  we  obtained  33  FPs  and  5037  TNs  out  of  5070 
runs  of  the  conflict  axiom.  Because  of  the  complex¬ 
ity  of  the  criteria  the  revision  was  accomplished  by 
focusing  the  revision  space  to  the  longitudinal  separa¬ 
tion  criteria  (i.e.  concept  L  in  Figure  2)  rather  than 
from  the  initial  training  instances.  L  was  selected  by 
studying  the  output  of  blame  assignment  for  all  the 
FPs,  and  the  generalised  explanation  output  for  indi¬ 
vidual  FPs.  Longitudinal  separation  values  in  minutes 
can  be  5,6,7,8,9,10,15,20  or  30,  and  the  CPS  contains 
formalised  criteria  for  all  of  these.  75  new  training 
instances  were  generated  from  proof  trees  and  proof 
traces  in  which  a  longitudinal  separation  value  of  10 
minutes  was  assigned  to  two  aircraft  at  least  one  of 
which  is  flying  at  subsonic  speed.  The  training  in¬ 
stances  included  25  FN  and  50  TP,  the  concept  being: 

the_basic_min_longitudinal_sep_Val_in_mins_ 
required_for(Segmentl,Segment2)  =  10. 

The  TP’s  were  generated  by  re-running  the  day’s 
worth  of  instances,  and  identifying  those  in  vertical 
conflict  that  gave  a  longitudinal  separation  of  10  min¬ 
utes,  but  were  not  in  overall  conflict  according  to  both 
air  traffic  control  officers  and  the  CPSef  (thus  lower¬ 
ing  the  possibility  of  noisy  data).  The  FN’s  of  con¬ 
cept  L  are  derived  directly  from  the  33  false  positives 
from  the  conflict  predicate.  The  algorithm  using  sim¬ 
ple  reverse  and  dropping  conditions  operators  returned 
a  new  theory  with  two  clauses  altered  by  both  the  op¬ 
erators;  after  revision,  74  of  the  training  instances  were 
covered,  and  only  1  (FN)  uncovered.  Significantly,  one 
of  the  clauses  that  was  revised,  defining  the  predicate: 

are_after_a_common_pt_from_which_profile_tracks 

_are_same_or_diverging_thereafter_and_at_which 

_both_aircraft_have_already_reported_by 

has  been  subsequently  identified  as  an  incorrect  read¬ 
ing  of  an  ATC  Manual. 

4.1.2  A  Requirements  Change  Experiment 

The  method  for  implementing  composite  ordinal  oper¬ 
ators  is  shown  in  Figure  3.  Steps  1  and  2  are  similar 
to  those  of  the  simple  operator.  If  after  step  2,  FN 
is  larger  than  FP,  then  generalisation  of  a  clause  C 
occurs  in  steps  4b  ..  8b  in  a  similar  manner.  Note 
that  the  driver  for  the  algorithm  is  the  stability  of  the 
clauses  in  OP,  rather  than  the  increase  in  accuracy  of 

r. 


Figure  2:  Algorithm  for  Simple  Ordinal  Operators 
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1.  Collect  training  set  of  instances  of  a  concept,  known 

to  contain  misclassified  instances.  Initialise:  D  = 
Deleted  clauses  =  {  ]  ,  A  =  added  clauses  =  {  }■ 

2.  Classify  training  instances  into  TNs,  FPs,  FN’s, 

TPs  and  calculate  accuracy. 

3a.  Specialise  F  in  a  manner  indicated  by  steps  3a  .. 
8a.  Run  blame  assignment  on  instances  in  FP 
giving  set  of  potential-pairs,  P. 

P  =  {(C,  A^)  I  C  revisable  clause  in  F  and 
JV  is  the  potential  of  (7  }. 

Find  subset  OP  of  P: 

OP  =:  {{C,N)  I  iC,N)  e  P  A  C  contains  an 
ordinal  relation  }. 

4a.  Select  pair  {C,N)  where  N  is  maximum  of 
{N  I  {C,N)  €}  OP. 

5a.  Calculate  the  n  dimensional  regions  'R.pn,T^tn 
defined  by  (1),  (2)  and  (3). 

6a.  If  TZjrr^/  yV^TN  are  not  equal 
set  head  of  C"  :=  head  of  C; 
set  body  of  C  :=  body  of  C  with  S  replaced  by 
£'  from  (5) 
else 

delete  C  from  OP  and  repeat  from  step  4a. 

7a.  Replace  C  with  C  and  calculate  accuracy. 

8a.  D'  =  D^C  ]  A'  =  AV^C'. 

9.  Repeat  from  step  1  until  OP  is  stable  or  accuracy 
is  100  %. 

Figure  3:  Algorithm  for  Composite  Ordinal  Operators. 

Because  of  the  safety-critical  nature  of  the  application, 
and  the  fact  that  some  data  values  may  occur  only 
rarely,  it  is  necessary  to  check  that  a  clause  (7  £  A  orig¬ 
inally  arising  from  a  function,  remains  defined  over  its 
intended  domain.  If  this  is  the  case,  a  post-processing 
phase  is  necessary. 

204  training  instances  (classified  according  to  post- 
RVSM  criteria)  of  the  conflict  axiom  were  used  to  re¬ 
vise  the  CPS  using  the  algorithm  in  Figure  3.  How¬ 
ever,  revisions  were  confined  to  ordinals  of  the  form 
‘is_above’.  When  tested,  there  were  found  to  be  121 
FP  instances,  and  83  FN  instances.  The  ’blame  as¬ 
signment  pinpointed  the  clause 

’the_min_vertical_sep_Val_in_feet_required_for( 

A,  B,  C,  D,  2000)’ 


as  a  revision  point  and  the  results  are  shown  below. 
(The  ‘limitvar’  predicate  is  a  device  for  marking  vari¬ 
able  occurrences.)  As  can  be  seen,  for  supersonic  air¬ 
craft,  the  criteria  is  unaltered.  The  criteria  for  a  verti¬ 
cal  separation  of  2000  feet  are  specialised;  they  exclude 
the  region  where  both  flight  levels  are  between  FL  330 
and  FL  370  as  shown  in  the  following  result: 

lengths  of  FN,  FP,  TN,  TP 

0  121  83  0 

*/.•/.  set  P. 

[potentiald ,  121)  ,potential(2, 121) ,  .  . , 
potential(23, 1) ,potential(26, 121) ,  . .] 

*/,7,  list  of  revision  points 
[26] 

New_accuracy  =  100.0,  Old.accuracy  =  40.686 
‘/.'/.revised  code  for  2000 

the_min_vertical_sep_Val_in_f eet_required_for( 
A,  B,  C,  D,  2000) 

(both_are_f lown_at_subsonic_speed(B,  D) , 

(A  is.above  fl(290),  limitvar (1) , 

((  not _ (A  is_at_or_above  fl(330)) 

;  not _ (A  is_at_or_below  fl(370))) 

;  not _ (C  is_at_or_above  f 1(330)) 

;  not _ (C  is_at_or .below  fl(370))) 

;  C  is.above  f 1(290),  limitvar (2), 

((  not _ (A  is.at.or.above  f 1(330)) 

;  not _ (A  is. at.or .below  f 1(370)) 

) 

;  not _ (C  is.at.or.above  fl(330)) 

;  not _ (C  is.at.or.below  fl(370)))) 

> 

one.or.both.of .are. flown. at. supersonic. speed ( 

B,  D), 

(A  is.at.or.below  fl(430),  limitvar(3), 

;  C  is.at.or.below  f 1(430)  ,  limitvar (4))) ,! . 

5  RELATED  WORK 

Some  recent  work  has  pointed  to  the  similarities 
between  the  validation  of  requirements  models  and 
knowledge  based  systems  development  (McCluskey 
et  al.  1996;  Shaw  and  Gaines  1996),  and  hence  the  area 
of  Knowledge  Base  Refinement  (KBR.)  is  related  to 
our  work.  A  detailed  comparison  of  validation  in  soft¬ 
ware  engineering  and  KBS  is  given  in  reference  (Ver- 
mesan  and  Bench-Capon  1995),  and  the  state  of  the 
art  in  automated  KBS  validation  is  surveyed  in  refer- 
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ence  (Zlatareva  and  Preece  1994). 

As  far  as  we  are  aware  our  work  is  the  first  to  apply  ma¬ 
chine  learning  techniques  to  formal  specifications  of  re¬ 
quirements,  although,  as  mentioned  above,  work  most 
related  to  our  own  occurs  in  the  field  of  KBR.  Both 
areas  have  to  adopt  strategies  to  overcome  the  com¬ 
plexity  pitfalls  surrounding  the  use  of  TR  (where  the¬ 
oretical  results  suggest  that  no  polynomial  algorithm 
exists  to  perform  global  optimisation  in  hill  climbing 
algorithms  (Greiner  1995)).  In  KRUST  (Palmer  and 
Craw  1996),  for  example,  test  cases  are  used  one  at 
a  time  to  refine  the  KBS,  in  contrast  to  our  focusing 
procedure,  which  uses  multiple  examples  and  a  form 
of  statistical  blame  assignment.  In  MOBAL,  an  envi¬ 
ronment  for  knowledge  acquisition  that  has  been  used 
with  a  large  security  rule  base,  TR  is  also  used  but  in 
restrained  fashion  and  with  limited  success  (see  (Som¬ 
mer  et  al.  1994)  page  453).  Experience  with  MOBAL 
is  consistent  with  our  experience  that  ML  tools  work 
well  in  the  context  of  a  diverse  tools  environment. 

Imperfect  theory  refinement  techniques  have  been  well 
researched  in  the  machine  learning  literature,  includ¬ 
ing  reviews  (Wrobel  1996),  and  a  text  relating  ML  to 
Software  Engineering  (Bergadano  and  Gunetti  1996). 
The  case  where  theories  represent  planning  domains  is 
described  in  reference  (Tae  and  Cook  1996)  and  the 
case  where  theories  are  posed  as  Horn  Clause  mod¬ 
els  is  described  in  reference  (Richards  and  Mooney 
1995).  Machine  learning  in  domains  containing  sig¬ 
nificant  numerical  components  has  previously  been  ac¬ 
complished  by  using  neural  networks  (Opitz  and  Shav- 
lik  1997).  Constraint  Inductive  Logic  Programming 
(Anthony  and  Frisch  1997;  Sebag  and  Rouveirol  1996) 
has  been  utilised  for  generalisation  and  specialisation 
of  numerical  predicates.  Theory  Patching  (Argamon- 
Engelson  and  Koppel  1998)  is  described  as  a  type  of 
TR  in  which  revisions  are  made  to  individual  compo¬ 
nents  of  the  theory.  (The  concern  of  the  latter  paper 
is  to  determine  which  classes  of  logical  domain  theo¬ 
ries  the  theory  patching  problem  is  tractable.)  Theory 
patching  compares  with  our  work  on  focusing  on  ordi¬ 
nal  revisions  and  on  shielding  clauses  which  are  not  to 
be  revised. 

6  CONCLUSIONS  AND  FURTHER 
WORK 

In  this  paper  we  have  reported  the  application  of  the¬ 
ory  revision  techniques  to  the  validation  and  main¬ 
tenance  of  a  substantial  ‘theory’,  the  formal  require¬ 
ments  model  of  an  air  traffic  control  application.  The 


model  is  encoded  in  msl,  is  customised  by  a  genera¬ 
tive  grammar,  animated  by  a  Prolog  generator,  and 
can  be  analysed  using  an  integrated  environment  sup¬ 
porting  a  diverse  range  of  validation  techniques  (Mc- 
Cluskey  1997).  After  overcoming  problems  to  do  with 
blame  assignment  in  general  clause  form  programs 
(West  et  al.  1997),  we  developed  the  method  whereby 
batches  of  tests  were  used  by  blame  assignment,  and 
single  tests  were  used  by  explanation-based  tools,  to 
identify  axioms  sets  in  which  bugs  were  likely  to  reside. 
After  acquiring  classified  instances  for  these  faulty 
components,  we  used  theory  revision  operators,  tar¬ 
geting  comparison  operators  acting  on  ordinal  sorts, 
to  identify  and  remove  the  bugs.  Here  we  have  shown 
two  different  experiments  where  bugs  were  identified 
and  removed,  and  a  new  part  of  the  model  was  in¬ 
duced.  The  project  started  with  an  error  rate  for  the 
conflict  predicate  of  several  hundred  errors  per  10,000 
tests.  The  application  of  ML  techniques  in  general  has 
lead  us  to  establish  the  cause  of  all  the  errors  shown 
up  in  our  initial  tests,  and  the  error  rates  using  code 
generated  from  the  current  version  of  our  model  have 
been  cut  by  2  orders  of  magnitude.  Having  said  this, 
our  success  in  fielding  TR  seems  to  depend  on  correctly 
predicting  how  fundamental  the  revisions  are,  and  hav¬ 
ing  the  machinery  available  to  bring  about  such  a  level 
of  revision. 

Many  problems  for  future  work  remain,  however.  Most 
outstanding  is  the  generalisation  of  our  environment  so 
that  other  customised  msl  models  can  be  created  and 
analysed  using  ML  tools.  Secondly,  the  TR  algorithms 
for  simple  and  composite  revisions  need  to  be  further 
refined  and  perhaps  merged.  Also,  the  implications  of 
using  blame  assignment  which  takes  into  account  neg¬ 
ative  literals  in  proof  trees  needs  to  be  fully  evaluated. 
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Abstract 

Adaptive  systems  can  learn  to  add  an  optimal 
amount  of  noise  to  some  nonlinear  feedback 
systems.  Noise  can  improve  the  signal-to- 
noise  ratio  of  many  nonlinear  dynamical  sys¬ 
tems.  This  “stochastic  resonance”  effect  oc¬ 
curs  in  a  wide  range  of  physical  and  biological 
systems.  The  SR  effect  may  also  occur  in  en¬ 
gineering  systems  in  signal  processing,  com¬ 
munications,  and  control.  The  noise  energy 
can  enhance  the  faint  periodic  signals  or  faint 
broadband  signals  that  force  the  dynamical 
systems.  Most  SR  studies  assume  full  knowl¬ 
edge  of  a  system’s  dynamics  and  its  noise  and 
signal  structure.  Fuzzy  and  other  adaptive 
systems  can  learn  to  induce  SR  based  only 
on  samples  from  the  process.  These  samples 
can  tune  a  fuzzy  system’s  if-then  rules  so  that 
the  fuzzy  system  approximates  the  dynami¬ 
cal  system  and  its  noise  response.  The  pa¬ 
per  derives  the  SR  optimality  conditions  that 
any  stochastic  learning  system  should  try  to 
achieve.  The  adaptive  system  learns  the  SR 
effect  as  the  system  performs  a  stochastic 
gradient  ascent  on  the  signal-to-noise  ratio. 
The  stochastic  learning  scheme  does  not  de¬ 
pend  on  a  fuzzy  system  or  any  other  adap¬ 
tive  system.  The  learning  process  is  slow  and 
noisy  and  can  require  heavy  computation. 
Robust  noise  suppressors  can  improve  the 
learning  process  when  we  can  estimate  the 
impulsiveness  of  the  noise  or  of  other  learn¬ 
ing  terms.  Simulations  test  this  SR  learning 
scheme  on  the  popular  quartic-bistable  dy¬ 
namical  system  and  on  other  dynamical  sys¬ 
tems  for  many  types  of  noise.  Simulations 
suggest  that  fuzzy  techniques  and  perhaps 
other  “intelligent”  techniques  can  induce  SR 
in  many  cases  when  users  cannot  state  the 
exact  form  of  the  dynamical  systems. 


1  STOCHASTIC  RESONANCE 

Noise  can  sometimes  enhance  a  signal  as  well  as  cor¬ 
rupt  it.  This  fact  may  seem  at  odds  with  almost  a 
century  of  effort  in  signal  processing  to  filter  noise  or 
to  mask  or  cancel  it.  But  noise  is  itself  a  signal  and 
a  free  source  of  energy.  Noise  can  amplify  a  faint  sig¬ 
ned  in  some  feedback  nonlinear  systems  even  though 
too  much  noise  can  swamp  the  signal.  This  implies 
that  a  system’s  optimal  noise  level  need  not  be  zero 
noise.  It  also  suggests  that  nonlinear  signal  systems 
with  nonzero-noise  optima  may  be  the  rule  rather  than 
the  exception. 

Stochastic  resonance  (SR)  [2,  3,  16]  occurs  when  noise 
enhances  an  external  forcing  signal  in  a  nonlinear  dy¬ 
namical  system.  SR  occurs  in;  a  signal  system  if  and 
only  if  the  system  has  a  nonzero  noise  optimum.  The 
classic  SR  signature  is  a  signal-to-noise  ratio  (SNR) 
that  is  not  monotone.  Figure  1  shows  the  SR  effect  for 
the  popular  quartic  bistable  dynamical  system  [2,  3]. 
The  SNR  rises  to  a  maximum  and  then  falls  as  the 
variance  of  the  additive  white  noise  grows.  More  com¬ 
plex  systems  may  have  multimodal  SNRs. 

SR  holds  promise  for  the  design  of  engineering  systems 
in  a  wide  range  of  applications.  Engineers  may  want  to 
shape  the  noise  background  of  a  fixed  signal  pattern  to 
exploit  the  SR  effect.  Or  they  may  want  to  adapt  their 
signals  to  exploit  a  fixed  noise  background.  Engineers 
now  add  noise  to  some  systems  to  improve  how  humans 
perceive  signals  [12,  14].  Some  control  schemes  add  a 
noise-like  dither  to  improve  system  performance  [18]. 

The  study  of  SR  has  emerged  largely  from  physics  and 
biology.  The  awkward  term  “stochastic  resonance” 
stems  from  a  1981  article  in  which  physicists  observed 
“the  cooperative  effect  between  internal  mechanism 
and  the  external  periodic  forcing”  in  some  nonlinear 
dynamical  systems  [2].  Scientists  soon  explored  SR  in 
climate  models  [17]  to  explain  how  noise  could  induce 
periodic  ice  ages  [1].  They  conjectured  that  global  or 
other  noise  sources  could  amplify  small  periodic  vari- 
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Figure  1:  The  non-monotonic  signature  of  stochas¬ 
tic  resonance.  The  graph  shows  the  smoothed  output 
signal-to-noise  ratio  of  the  noisy  signal-forced  quartic 
bistable  system  x  =  f{x)  +  s{t)  +  n{t)  —  x-x^  +  s{t)  + 
n{t).  The  vertical  dashed  lines  show  the  absolute  devi¬ 
ation  between  the  smallest  and  largest  outliers  in  each 
sample  average  of  20  outcomes.  The  system  has  a 
nonzero  noise  optimum  and  thus  shows  the  SR  effect. 
The  Gaussian  noise  n{t)  adds  to  the  external  forcing 
narrowband  signal  s{t)  =  esinwof.  Other  systems  can 
use  multiplicative  noise  or  use  non-Gaussian  noise  [4]. 


ations  in  the  Earth’s  orbit.  This  might  explain  the 
observed  100,000  year  primary  cycle  of  the  Earth’s  ice 
ages.  Physicists  have  since  found  stronger  evidence  of 
SR  in  various  systems  [11,  16,  19]. 

Below  we  explore  how  to  learn  the  SR  effect  with  adap¬ 
tive  systems  in  general  and  with  adaptive  fuzzy  func¬ 
tion  approximators  [9]  in  particular.  Neural-like  learn¬ 
ing  laws  tune  and  move  the  fuzzy  rule  patches  as  they 
tune  the  shape  of  the  fuzzy  sets  that  make  up  the 
rule  patches.  The  learning  laws  use  input-output  data 
from  the  sampled  noisy  dynamical  system.  The  rule 
patches  move  quickly  to  cover  optimal  or  near-optimal 
regions  of  the  function  (such  as  its  extrema).  Fuzzy 
systems  achieve  their  patch-covering  approximation  at 
the  high  cost  of  rule  explosion  [9].  The  number  of  rules 
grows  exponentially  with  the  state-space  dimension  of 
the  fuzzy  system.  We  stress  that  our  SR  learning  laws 
can  also  tune  non-fuzzy  adaptive  systems.  Our  first 
goal  was  to  show  that  adaptive  systems  can  learn  to 
shape  the  input  noise  and  perhaps  shape  other  terms 
to  achieve  SR  in  the  main  closed-form  dynamical  sys¬ 
tems  that  scientists  have  shown  produce  the  SR  effect. 
Our  second  goal  was  to  suggest  through  these  sim¬ 
ulation  experiments  that  adaptive  fuzzy  systems  or 
other  model-free  approximators  might  achieve  SR  in 
the  more  complex  dynamical  systems  that  defy  easy 


math  modeling  or  measurement. 

This  paper  presents  three  main  results.  The  first  and 
central  result  is  that  a  system  can  learn  the  SR  effect  if 
it  performs  a  stochastic  gradient  ascent  on  the  signal- 
to-noise  ratio  SNR  =  S/N.  Then  the  random  noise 
gradient  can  tune  the  parameters  in  any  adap¬ 
tive  system  through  a  slow  type  of  stochastic  approxi¬ 
mation.  The  second  result  is  that  the  SNR  first-order 
condition  for  an  extremum  has  the  ratio  form  ^  ^ 

for  5'  =  1^.  The  term  ^  can  produce  impulsive  or 
even  Cauchy  noise  that  can  destabilize  the  stochastic 
gradient  ascent.  Time  lags  in  the  training  process  can 
compound  this  impulsiveness.  The  third  result  is  that 
a  Cauchy-based  noise  suppressor  from  the  theory  of 
robust  statistics  can  often  reduce  the  impulsiveness  of 
the  noise  gradient  and  thus  improve  the  learning 
process. 

2  ADDITIVE  FUZZY  SYSTEMS  & 
FUNCTION  APPROXIMATION 

A  fuzzy  system  F  :  R"  -t  R'’  stores  m  rules  of  the 
word  form  “If  X  =  Aj  Then  Y  =  R/’  or  the  patch 
form  Aj  X  Bj  C  X  X  Y  =  R'^  X  .  The  if-part  fuzzy 
sets  Aj  C  R"  and  then-part  fuzzy  sets  Bj  C  R'’  have 
set  functions  Uj  :  R"  ->  [0,1]  and  bj  :  R’’  [0,1]. 

Generalized  fuzzy  sets  map  to  intervals  other  than 
[0,1].  The  scalar  sine  set  functions  in  Figure  6  map 
real  inputs  to  “membership  degrees”  in  the  bipolar 
range  [-0.217,1].  The  system  design  must  take  care 
when  these  negative  set  values  enter  the  SAM  ratio  in 
(2).  The  system  can  use  the  joint  set  function  Uj  or 
some  factored  form  such  as  aj{x)  =  al(xi)  •  •  ■a'j{xn) 
or  Oj(z)  =  min(aj(xi), . . .  ,o"(3;„))  or  any  other  con¬ 
junctive  form  for  input  vector  x  =  (xi, . . .  ,i„)  G  R" 
[9].  An  additive  fuzzy  system  [9]  sums  the  “fired”  then- 
part  sets  B'j  : 

m  m 

=  Yl'^jO‘j{x)Bj.  (1) 

j=i  i=i 

Figure  2a  shows  the  parallel  fire-and-sum  structure  of 
the  standard  additive  model  (SAM).  These  nonlinear 
systems  can  uniformly  approximate  any  continuous  (or 
bounded  measurable)  function  /  on  a  compact  domain 
[9].  Engineers  often  apply  fuzzy  systems  to  problems 
of  control  but  fuzzy  systems  can  also  apply  to  problems 
of  communication  and  signal  processing  [9]  and  other 
fields. 

Figure  2b  shows  how  three  rule  patches  can  cover 
part  of  the  graph  of  a  scalar  function  f  :  R  R. 
The  patch-cover  structure  implies  that  fuzzy  systems 
F  :  R"  ->  RP  suffer  from  rule  explosion  in  high  dimen¬ 
sions.  A  fuzzy  system  F  needs  on  the  order  of 
rules  to  cover  the  graph  and  thus  to  approximate  a 


Stochastic  Resonance  with  Adaptive  Fuzzy  Systems  379 


Figure  2:  Feedforward  fuzzy  function  approximator,  (a)  The  parallel  associative  structure  of  the  additive  fuzzy 
system  F  :  iZ"  — ^  RP  with  m  rules.  Each  input  xq  €  iZ”  enters  the  system  F  as  a  numerical  vector.  At  the 
set  level  xq  acts  as  a  delta  pulse  S(x  -  xq)  that  combs  the  if-part  fuzzy  sets  Aj  and  gives  the  m  set  values 
aj(a;o)  =  S(x  —  xo)aj(x)dx.  The  set  values  “fire”  or  scale  the  then-part  fuzzy  sets  Bj  to  give  B'j.  A  standard 
additive  model  (SAM)  scales  each  Bj  with  aj{x).  Then  the  system  sums  the  Bj  sets  to  give  the  output  “set” 
B.  The  system  output  F{xo)  is  the  centroid  of  B.  (b)  Fuzzy  rules  define  Cartesian  rule  patches  Aj  x  Bj  in  the 
input-output  space  and  cover  the  graph  of  the  approximand  /. 


vector  function  f  ■.  R^.  Optimal  rules  can  help 

deal  with  the  exponential  rule  explosion.  Lone  or  local 
mean-squared  optimal  rule  patches  cover  the  extrema 
of  the  approximand  /  [9].  They  “patch  the  bumps.” 
Better  learning  schemes  move  rule  patches  to  or  near 
extrema  and  then  fill  in  between  extrema  with  extra 
rule  patches  if  the  rule  budget  allows. 

The  scaling  choice  B'j  =  aj{x)Bj  gives  a  standard  ad¬ 
ditive  model  or  SAM.  Taking  the  centroid  of  B{x)  in 
(1)  gives  the  following  SAM  ratio  [9] 


F{x) 


Wjaj{x)VjCj 
Ej”  1  Wjaj{x)Vj 


Y^Pj{x)cj.  (2) 
j=i 


The  if-part  fuzzy  sets  Aj  C  iZ"  has  set  functions  Cj  : 
iZ”  — >  [0,1].  The  then-part  sets  Bj  C  R^  has  finite 
positive  volume  or  area  Vj  and  centroid  or  its  center  of 
mass  Cj .  The  convex  weights  Pi{x), ...  ,Pmix)  have  the 

form  Pj{x)  =  _  The  convex  coefficients 

Pj  (x)  change  with  each  input  vector  x.  We  can  ignore 
the  rule  weights  Wj  if  we  put  wi  =  ...  =  Wm  >  0. 


3  SR  LEARNING  AND 
EQUILIBRIUM 

The  scalar  standard  additive  model  (SAM)  [9]  fuzzy 
system  F  :  iZ"  ->  iZ  can  learn  the  SR  pattern  of  op¬ 
timum  noise  of  an  unknown  dynamical  system  if  it 
uses  enough  rules  and  if  it  samples  enough  data  from 
a  dynamical  system  that  stochastically  resonates.  Be¬ 
low  we  derive  a  gradient-based  learning  law  that  tunes 
the  SAM  parameters  to  achieve  SR  from  samples  of 


system  dynamics.  It  can  also  tune  the  parameters  in 
other  adaptive  systems.  We  first  define  a  practical 
SNR  measure  in  terms  of  discrete  Fourier  transforms. 
Other  SR  measures  can  give  other  learning  laws. 


3.1  THE  SNR  IN  NONLINEAR  SYSTEMS 


Suppose  a  nonlinear  dynamical  system  has  a  sinewave 
forcing  function  s(t)  of  known  frequency  fo  Hz.  We 
search  the  sinusoidal  part  r{t)  of  the  output  y{t)  for 
the  known  frequency  fo  but  unknown  amplitude  and 
phcise  in  the  system  output  response  y{t).  The  “noisy 
signal”  y{t)  has  the  form  of  “signal”  plus  “noise”: 

yt  =  rt  +  nt.  (3) 


The  signal-to-noise  ratio  (SNR)  at  the  output  is  the 
spectral  ratio  of  the  energy  of  {r*}  to  the  energy  of 
{nt}.  We  assume  that  the  signal  s{t)  is  always  present. 
This  ignores  the  important  problem  of  signal  detection 
but  lets  us  focus  on  learning  the  SR  effect. 

We  define  the  SNR  measure  as 


SNR 


N 


S 

P-S' 


(4) 


Here  S  =  2\Y[ko]\^,  P  =  Efc=o  and  F[A:]  is 

the  L-point  discrete  Fourier  transform  (DFT)  of  yn- 


Y[k\  =  (5) 

t=o 

We  assume  that  the  discrete  frequency  ko  —  foLTs  >  0 
is  an  integer  for  sampling  rate  l/Tg  and  wq  =  27r/o-  We 
also  assume  that  there  is  no  aliasing  due  to  sampling. 
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Then  we  can  show  that  for  large  L  the  SNR  measure  in 
(4)  tends  to  the  standard  definition  of  SNR  as  a  ratio 
of  variances: 

Theorem:  SNR  =  2|r[fco]p - 

\Y[k]\^-2\Y[koW 


Here  cr^ 

1  rT 


Var(n)  =  E[n^\  <  oo  and  = 


i  f  {Asmu}ot)^dt  —  A^/2. 
T  Jo 


3.2  SR  LEARNING  AND  OPTIMALITY 

An  adaptive  system  can  learn  a  SR  noise  pattern  that 
maximizes  a  dynamical  system’s  SNR.  The  learning 
law  updates  a  parameter  mj  of  a  SAM  fuzzy  system 
(or  of  any  other  adaptive  system)  at  time  step  n  with 
the  deterministic  law 


mj{n  +  1)  =  mj{n)  4-  Hn 


aE[SNR] 


for  learning  coefficients  {fin}-  This  is  gradient  ascent 
learning.  We  assume  that  the  first-order  moment  of 
the  SNR  exists  and  is  finite.  We  seldom  know  the 
probability  structure  or  the  expectation  of  the  SNR. 
So  we  estimate  this  expectation  with  its  random  real¬ 
ization  at  each  time  step:  ^[SNR]  «  SNR.  This  gives 
the  stochastic  gradient  learning  law 


mj{n  +  l)  =  mj{n)  +  fin 


a  SNR 


or  simple  random  hill  climbing.  We  assume  the  chain 
rule  holds  (at  least  approximately)  to  give 


a  SNR 
dmj 


a  SNR  da 
da  dmj 


a SNR  dF 
da  dmj 


Here  a  is  the  noise  level  or  standard  deviation  of  the 
forcing  noise  term  n{t).  We  want  the  SAM  or  other 
adaptive  system  F  to  approximate  the  optimal  noise 
level  aopt  for  any  input  signal  or  initial  condition  of 
the  dynamical  system:  F  x  a  opt-  We  then  use  a  and 
F  interchangeably  in  (8).  The  term  shows  how 
any  adaptive  system  F  depends  on  its  jth  parameter 
mj .  We  again  assume  that  the  chain  rule  holds  to  get 


a  SNR  a  SNR  55  aSNRaA 
da  “  dS  da dN  da  ' 
Then  SNR  =  10  log  S/N  implies  that 

asNR  a  5 

=  3s'"‘'’6jv  = 

asNR  _  a  5  _  .. 


Al01og|  =  (lOloge)! 

^lOiog^  =  -(loioee) 


for  base-10  logarithm.  We  next  put  (lO)-(ll)  into  (9) 
to  get  the  log  term  that  drives  SR  learning: 

^SNR  ,fldS  laiVx 

-a^  =  (“i‘>g')lsa?-N^)' 

The  right  side  of  (12)  leads  to  the  first-order  condition 
for  an  SNR  extremum: 


S  da  N  da 


=  »  (13) 


when  the  partial  derivatives  of  5  and  N  with  respect 
to  a  are  not  zero  at  cr  =  aopt-  Equation  (13)  gives  a 
necessary  condition  for  the  SR  maximum.  The  result 
(13)  says  that  at  SR  the  ratio  of  the  rate  of  changes  of  5 
and  N  must  equal  the  ratio  of  5  and  N.  But  (13)  holds 
only  in  a  stochastic  sense  for  sufficiently  well-behaved 
random  processes.  The  second-order  condition  for  an 
SR  maximum  is 

^  a^SNR  a 


-^^1(14) 

N  da  ’ 

1  d^s  1  /a5\2 

Sda^  S^\da) 

1  d^N  1  (dN- 

)1 

N  da^  N‘^\  da. 
1  d^s  1  d^N  1 

(16) 

5  da^  N  da^  \ 

or  ^  The  last  equality  follows  from  the  first- 

order  condition  5ff-;^|i7=0or^  =  ^  since  then 

.  A  like  result  holds  for  SNR  =  S/N. 
These  first-  and  second-order  conditions  show  how  the 
signal  power  5  and  noise  power  N  relate  to  each  other 
and  to  their  derivatives  at  the  SR  maximum. 

We  now  derive  the  SR  learning  laws  in  terms  of  DFTs. 
We  can  approximate  and  with  a  ratio  of  time 
differences  at  each  iteration  n: 


_  Sn  ~~  5n-l 

dan 

~  A(t„ 

“  ^n  — 1 

dNn 

NNn 

Nn  -  Nn-l 

dan 

~  A(7„ 

On  On—\ 

Then  put  (17)  and  (18)  into  (9)  to  get  the  stochastic 
gradient  learning  law: 

n+l  n  I  aSNRn 

(19) 

„  ,  aSNR„  dF  „„ 

=  m'l  +  fin—^ - ^ —  (20) 

OO^n  uTJTj 

—  "  4-  I  (  ^  _ 1  a  Nn  N  dF  .  . 

-  mj  ;V„  dan  J  dmj' 

Below  we  derive  the  last  partial  derivative  in  the 
chain-rule  expansion  (8)  for  all  SAM  fuzzy  parameters. 
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Figure  3:  Learning  paths  of  On  for  the  quartic  bistable  system  (30)-(31).  The  input  sinusoid  signal  function 
is  s{t)  =  0.1  sin27r(0.01)t.  The  optimum  noise  intensity  lies  near  a  =  0.5  from  the  SNR-noise  profile  in  Figure 
1.  (a)  Impulsive  effects  on  learning  paths  of  noise  level  with  different  initial  values.  The  paths  of  cr„  do 
not  converge  to  the  optimum  noise.  This  stems  from  the  impulsiveness  of  the  derivative  term  in  the  SR 

learning  law.  (b)  Learning  paths  of  (T„  with  the  Cauchy  noise  suppressor  0.  The  term  (j>{  ^  )  replaces 

in  the  SR  learning  law  as  in  (36).  The  paths  of  (j„  wander  in  a  Brownian-like  motion  around  the  optimum  noise. 
The  suppressor  function  cj)  makes  the  learning  algorithm  more  robust  against  impulsive  shocks. 


This  is  again  the  step  where  users  can  insert  other 
adaptive  function  approximators  F  and  derive  learning 
laws  for  their  parameters  mj  by  expanding  The 
chain  rule  gives  the  partial  derivatives 

OCj  /  0>i[x)  Vj 

W  _  _  wW[e,-_F(j)](23) 


OF 

drJij 

where 


dF  daj  A  ^  (0A\ 

doj  druj  ddj  daj  ddj 


We  used  the  sine  set  functions  [9,  13]  in  our  simu¬ 
lation.  The  sine  set  function  has  the  form  aj{x)  = 

sin  ^  partial  derivatives  are 

^  ^  f  (aj(x)  -  cos  (^))^  for  x  ^  mj 
dnij  ^0  for  X  =  mj 

We  used  small  but  constant  learning  rates  in  most  sim¬ 
ulations. 


learning  process  updates  the  noise  parameter  On  at 
each  sample  time  n.  The  learning  process  is  noisy 
and  may  not  be  stable  due  to  the  impulsiveness  of 
the  random  gradient  .  We  used  a  Cauchy  noise 

suppressor  from  the  theory  of  robust  statistics  [8]  to 
stabilize  the  learning  process.  Then  sample  paths  of 
(T„  converged  and  wander  about  the  optimal  values  if 
the  initial  values  were  close  to  the  optimum. 

The  response  of  a  system  depends  on  its  dynamics  and 
on  the  nature  of  its  input  signals.  We  applied  the  SNR 
measure  to  the  quartic  bistable  system  with  sinusoidal 
inputs.  Future  research  may  extend  SR  learning  to 
wideband  input  signals.  Figure  7a  shows  how  the  op¬ 
timum  noise  level  varies  for  each  input  sinewave  in  the 
quartic  bistable  system.  The  learning  process  sam¬ 
ples  the  system’s  input-output  response  as  it  learns 
the  optimum  noise.  It  does  not  make  direct  use  of  the 
equation  that  underlies  the  system.  It  needs  access 
only  to  the  system’s  input-output  responses.  Then  an 
adaptive  fuzzy  system  encodes  this  pattern  of  opti¬ 
mum  noise  in  its  if-then  rules  when  gradient  learning 
tunes  its  parameters.  The  fuzzy  system  learns  this 
optimum  noise  level  as  it  varies  the  output  of  a  ran¬ 
dom  noise  generator.  More  complex  fuzzy  systems  can 
themselves  act  as  adaptive  random  number  generators 
[9]. 


4  SIMULATION  RESULTS 


4.1  SR  IN  THE  QUARTIC  BISTABLE 
SYSTEM 


This  section  shows  how  the  stochastic  SR  learning  laws  We  tested  the  quartic  bistable  system  x  =  ax  —  bx^  -k 
in  Section  3  tend  to  find  the  optimal  noise  levels.  The  s{t)  +  n{t)  because  of  its  wide  use  in  the  SR  literature 
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Figure  4:  Learning  paths  of  (t„  with  the  suppressor  (f>  for  other  noise  densities  in  the  quartic  bistable  system 
(30)-(31)  with  input  signal  s{t)  =  0.1  sin27r(0.01)t.  The  noise  n  has  densities  (a)  Laplace  noise  and  (b)  uniform 
noise.  The  SNR-noise  profiles  show  that  optimal  noise  levels  lie  near  cr  =  0.5  for  both  cases. 


as  a  benchmark  SR  dynamical  system.  The  constants 
0  =  6=1  and  the  binary  output  give  the  system  [16] 

a:  =  X  -  +  s{t)  +  n{t)  (28) 

y{t)  =  sgn(x(t))  (29) 

where  s  =  esinwot  is  a  sinewave  input  forcing  term 
and  n  is  a  zero-mean  additive  white  noise  with  variance 
D  =  The  simulation  uses  the  discrete  version: 

xt+i  =  xt  +T{xt  -  Xi  +  e  sin  2'KfoTt)  -h  ■v/Tn<  (30) 
yt  =  sgn(a:t)  (31) 

with  initial  condition  xq  and  time  step  T.  The  zero- 
mean  white  noise  sequence  {rit}  has  variance  Dt  = 
CT^(t).  The  term  y/T  scales  nt  so  that  it  conforms 
with  the  Wiener  increment  [6].  The  simulations  use 
Gaussian  noise,  Laplace  noise,  and  uniform  noise. 


We  look  at  the  equilibrium  term  or  the  random  opti¬ 
mality  “error”  process 


S  dS/da 
N  dN/da 


(32) 


near  the  optimum  noise  a  =  aopt-  The  probability 
density  of  £  depends  on  the  statistics  of  the  input 


noise,  the  differential  equation  that  defines  the  dynam¬ 
ical  system,  and  how  we  define  the  signal  and  noise 
terms  S  and  N.  The  empirical  test  of  £„  found  that 
£„  had  infinite  variance  in  our  simulations.  The  log- 
tail  test  of  parameter  a  in  the  family  of  alpha-stable 
probability  densities  leads  to  the  estimate  q  ss  1.0.  So 
the  £„  density  is  approximately  Cauchy.  Recall  also 
that  Z  =  XfY  \s  s,  Cauchy  random  variable  if  X  and 
Y  are  Gaussian  or  if  they  obey  certain  more  general 
statistical  conditions  [10].  This  suggests  that  much  of 
the  impulsive  nature  of  €„  and  hence  of  the  learning 
process  may  stem  from  the  ratio  of  derivatives  in  (32). 

We  sample  S„  and  N„  after  a  long  period  of  time  in 
(17)  and  (18).  This  approximation  lets  us  choose  the 
time  length  between  step  n  and  step  n  -j-  1.  Longer 
time  lengths  can  better  show  how  the  noise  intensity 
<Tn  affects  S„,  Nn,  and  the  SNR„.  We  chose  the  time 
length  Tn+i  -Tn  =  2000  seconds  for  the  simulations. 
The  learning  process’s  sampling  interval  Tg  differs  from 
the  time  step  T  of  the  dynamical  system’s  simulator 
in  (30)-(31).  The  time  step  is  T  =  0.0195.  The  sam¬ 
pling  period  is  Tg  =  0.976  seconds.  This  yields  2048 
samples  per  iteration.  This  long  period  of  time  allows 
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Figure  5:  SR  learning  paths  of  an  for  other  dynamical  systems,  (a)  The  forced  bistable  neuron  model  x  — 
-x  +  2  tanh  x  +  e  sin(a;ot)  +  n{t)  with  binary  output  y{t)  =  sgn(a;(t)).  The  parameters  of  the  input  sinewave  are 
Wo  =  27r/o  with  fo  =  0.01  Hz  and  s  =  0.3.  (b)  The  FitzHugh-Nagumo  neuron  model  ex  =  -x{x'^  A  + 

s(t)  +  n{t)  and  w  =  x  —  w  with  output  y{t)  =  x{t).  The  parameters  are  e  =  0.005  and  A  =  — (5/12\/3  +  0.07). 
The  sinewave  input  signal  is  s{t)  =  £sin27r/ot  where  e  =  0.01  and  fo  =  0.5  Hz. 


for  low  frequency  signals  such  as  fo  =  0.001  Hz.  We 
ignored  all  aliasing  effects.  We  also  replace  the  differ¬ 
ence  an  — a„_i  with  sgn(a„  — a„_i)  to  avoid  numerical 
instability.  The  gradient  becomes 


aSNRn  _  /A5n 

0(7^  '  iFn 


AiVnN  . 

-nT) 


l) 


(33) 


for  ASn  =  Sn  —  Sn-1  and  AiV„  =  Ar„  —  Nn-i-  This 
approximation  gives  the  SR  learning  law  when  F  =  an- 


_  (  ASn  AATn 

r^n-l-l  —  CTnF  fJ>n  y  g  jy 


^  sgn(an-a„_i).  (34) 


Figures  3a  shows  sample  learning  paths  of  a„  for  the 
quartic  bistable  system.  The  an  learning  paths  con¬ 
verge  to  the  optimum  noise  values  only  in  some  cases. 
The  simulations  confirm  that  the  random  gradient 
in  (33)  is  often  impulsive  and  can  destabilize 

the  learning  process  (34).  The  impulsiveness  of 
suggests  that  it  may  have  an  alpha-stable  probability 
density  function  with  parameter  a  <2.  A  log-tail  test 
found  that  owl.  This  means  that  has  an 

approximate  Cauchy  distribution. 


The  theory  of  robust  statistics  [8]  suggests  one  way  to 
reduce  the  impulsiveness  of  We  can  replace 

the  noisy  random  sample  Zn  with  a  Cauchy-like  noise 
suppressor  (l>{zn)  [8]: 

(35) 

So  <f>{  ^  )  replaces  the  approximation  of  the  noise 

gradient  in  (33).  This  gives  the  robust  SR 

learning  law 

^fdSNRn\ 

<Tn+l  =  CTn  +  Pn  - )■  (36) 

Figure  3b  shows  the  results  of  the  SR  learning  law 
(36)  with  the  gradient  in  (33).  The  a„  learning  paths 
converge  to  the  optimum  noise  level  if  the  initial  value 
lies  close  enough  to  it  and  then  an  wanders  in  a  small 
Brownian-like  motion  about  the  optimum  noise  level. 

Like  results  hold  for  other  noise  densities  with  finite 
variance  such  as  Laplace  and  uniform  noise.  Figure  4 
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shows  cr„  learning  paths  for  the  quartic  bistable  system 
(30)-(31)  with  Laplace  noise  and  uniform  noise. 

4.2  SR  IN  OTHER  DYNAMICAL 
SYSTEMS 

We  tested  the  bistable  potential  neuron  model  with 
Gaussian  white  noise  [4] 

X  =  — a;  +  2  tanhx  +  s(t)  +  n(t)  (37) 

y{t)  =  sgn{x{t)).  (38) 

Figure  5a  shows  the  SR  learning  paths  of  On-  The 
sinewave  input  is  s{t)  =  esin27r/o<  where  /o  =  0.01 
Hz  and  e  =  0.1  and  the  e  =  0.3.  The  time  step  in 
the  discrete  simulation  is  T  =  0.0195.  The  sampling 
interval  is  Tg  =  0.975  or  50  times  the  time  step  T. 

We  next  tested  the  forced  FitzHugh-Nagumo  neuron 


model  [5,  7,  15] 

-  w  +  A  +  s{t)  +  n{t)  (39) 

ex  —  -x{x^  - 

W  =  X  —  w 

(40) 

y(t)  =  x(t). 

(41) 

The  constants  are  e  =  0.005,  a  =  0.5,  and  A  = 
-(5/12^/3  +  0.07).  The  sinewave  input  is  s{t)  = 
esin27r/of  with  s  =  0.01,  /o  =  0.1,  and  0.5  Hz.  The 
sampling  interval  is  T*  =  0.01  with  T  =  0.001.  Figure 
5b  shows  the  learning  paths  of  the  standard  deviation 
of  the  Gaussian  white  noise  n. 

4.3  FUZZY  SR  LEARNING:  THE 
QUARTIC  BISTABLE  SYSTEM 

We  used  a  fuzzy  function  approximator  F  :  R" 

R  to  learn  and  store  the  entire  surface  of  optimal 
noise  values  for  the  quartic  bistable  system  with  in¬ 
put  sinewaves.  The  fuzzy  system  had  as  its  input  the 
2-D  vector  of  sinewave  amplitude  e  and  frequency  /q. 
We  tested  the  system  with  the  fixed  input  initial  value 
a:(0)  =  -1.  The  fuzzy  system  itself  defined  a  vector 
function  F  :  R?  R  and  used  200  rules.  The  Cauchy 
noise  suppressor  gives  the  learning  law  (21)  as 

,  .  ./9SNR„n  dF 

m,(n  +  1)  =  m,(n)  +  (42) 

Figure  6  shows  how  we  formed  a  first  set  of  rules  on 
the  product  space  of  the  two  variables  e  and  /q.  It 
also  shows  how  the  learning  laws  move  and  shape  the 
width  of  the  if-part  sine  set.  Figure  7  shows  the  results 
of  SAM  learning  of  the  optimal  noise  pattern  for  the 
quartic  bistable  system.  The  sine  SAM  used  200  rules. 
F^ewer  rules  gave  a  coarser  approximation. 

5  CONCLUSIONS 

Stochastic  gradient  ascent  can  learn  to  find  the  SR 
mode  of  at  least  some  simple  dynamical  systems.  This 
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Figure  6:  If-part  sine  fuzzy  sets,  (a)  Scalar  sine  set 
function  a_,(x)  =  smx/x.  Sine  sets  are  generalized 
fuzzy  sets  with  “membership  values”  in  [-.217,1].  Ele¬ 
ment  X  belongs  to  set  Aj  to  degree  a_,  (x);  Degree(x  € 
Aj)  =  aj{x).  (b)-(c)  Initial  subsets  for  sinewave  am¬ 
plitudes  and  frequencies.  There  are  10  fuzzy  sets 
for  amplitude  £  and  20  fuzzy  sets  for  frequency  /q. 
The  product  of  two  1-D  sets  gives  the  2-D  joint  sets; 
aj{x)  =  aj{£,fo)  =  a](e)oj(/o).  So  the  product  space 
gives  10  X  20  =  200  if-part  sets  in  the  if-then  rules. 

learning  scheme  may  fail  to  scale  up  for  more  com¬ 
plex  nonlinear  dynamical  systems  of  higher  dimension 
or  may  get  stuck  in  the  local  maxima  of  multimodal 
SNR  profiles.  Simulations  showed  that  the  key  learn¬ 
ing  term  itself  can  give  rise  to  strong  impulsive  shocks 
in  the  learning  process.  These  shocks  often  approached 
Cauchy  noise  in  intensity.  A  Cauchy  noise  suppressor 
gave  a  working  SR  learning  scheme  for  the  DFT-based 
SNR  measure.  Other  SNR  measures  or  other  process 
statistics  may  favor  other  types  of  robust  noise  sup¬ 
pressors  or  may  favor  still  other  techniques  to  lessen 
the  impulsiveness. 

Gradient-ascent  learning  can  find  the  SR  mode  of  the 
main  known  dynamical  models  that  show  the  SR  ef¬ 
fect  and  can  do  so  in  the  presence  of  a  wide  range  of 
noise  types.  This  suggests  that  SR  may  occur  in  many 
multivariable  dynamical  systems  in  science  and  engi¬ 
neering  and  that  simple  learning  schemes  can  some¬ 
times  measure  or  approximate  this  behavior.  We  lack 
formal  results  that  describe  when  and  how  such  SR 
learning  algorithms  will  converge  for  which  types  of 
SR  systems.  This  reflects  the  general  lack  of  a  formal 
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Figure  7:  Optimal  noise  levels  in  terms  of  the  signal-to-noise  ratio  for  the  quartic  bistable  system  (30)-(31). 
(a)  The  optimum  noise  pattern  when  inputs  are  sinewaves  with  distinct  amplitudes  and  frequencies,  (b)  SAM 
fuzzy  approximation  of  the  optimum  noise  after  30  epochs.  The  sine  SAM  used  200  rules.  One  epoch  used  20 
iterations  that  trained  on  200  input  amplitudes  and  frequencies.  The  initialized  SAM  gave  the  output  value  0.2 
as  its  first  estimate  of  the  optimal  noise  level. 


taxonomy  in  this  promising  new  field:  Which  noisy  dy¬ 
namical  systems  show  what  SR  effects  for  which  forc¬ 
ing  signals? 
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Abstract 

This  paper  introduces  a  new  algorithm,  Q2, 
for  optimizing  the  expected  output  of  a  multi¬ 
input  noisy  continuous  function.  Q2  is  de¬ 
signed  to  need  only  a  few  experiments,  it 
avoids  strong  assumptions  on  the  form  of  the 
function,  and  it  is  autonomous  in  that  it  re¬ 
quires  little  problem-specific  tweaking. 

These  capabilities  are  directly  applicable  to 
industrial  processes,  and  may  become  in¬ 
creasingly  valuable  elsewhere  as  the  machine 
learning  field  expands  beyond  prediction  and 
function  identification,  and  into  embedded 
active  learning  subsystems  in  robots,  vehicles 
and  consumer  products. 

Four  existing  approaches  to  this  problem  (re¬ 
sponse  surface  methods,  numerical  optimiza¬ 
tion,  supervised  learning,  and  evolutionary 
methods)  all  have  inadequacies  when  the  re¬ 
quirement  of  “black  box”  behavior  is  com¬ 
bined  with  the  need  for  few  experiments.  Q2 
uses  instance-based  determination  of  a  con¬ 
vex  region  of  interest  for  performing  exper¬ 
iments.  In  conventional  instance-based  ap¬ 
proaches  to  learning,  a  neighborhood  was  de¬ 
fined  by  proximity  to  a  query  point.  In  con¬ 
trast,  Q2  defines  the  neighborhood  by  a  new 
geometric  procedure  that  captures  the  size 
and  shape  of  the  zone  of  possible  optimum 
locations.  Q2  also  optimizes  weighted  com¬ 
binations  of  outputs,  and  finds  inputs  to  pro¬ 
duce  target  outputs. 

We  compare  Q2  with  other  optimizers  of 
noisy  functions  on  several  problems,  includ¬ 
ing  a  simulated  noisy  process  with  both 
non-linear  continuous  dynamics  and  discrete- 
event  queueing  components.  Results  are  en¬ 
couraging  in  terms  of  both  speed  and  auton¬ 
omy. 


1  ACTIVE  LEARNING  FOR 
OPTIMIZATION 

The  apparently  humble  task  of  parameter  tweaking  for 
noisy  systems  is  of  great  importance  whether  the  pa¬ 
rameters  being  tweaked  are  for  an  algorithm,  a  real 
manufacturing  process,  a  simulation,  or  a  scientific  ex¬ 
periment.  The  purpose  of  this  paper  is  two-fold.  First, 
we  wish  to  highlight  the  potential  importance  of  ma¬ 
chine  learning  as  an  as-yet  underexploited  tool  in  this 
domain.  Second,  we  will  introduce  Q2,  a  new  algo¬ 
rithm  designed  for  this  domain. 

We  consider  a  generalized  noisy  optimization  task  in 
which  a  vector  x  of  real-valued  inputs  produces  a  scalar 
output  y  that  is  a  noisy  function  of  x: 

y  =  5(x)  -f  noise  (1) 

Given  a  constrained  space  of  legal  inputs,  the  task  is 
to  find  the  input  vector  Xopt  that  maximizes  g,  using 
only  a  small  number  of  experiments. 

In  both  industrial  settings  and  in  algorithm-tuning, 
this  task  often  demands  considerable  human  interven¬ 
tion  and  insight.  A  factory  manager  who  wants  to 
optimize  a  process  can; 

•  Buy  a  computer,  statistics  software,  and  hire  a 
professional  statistician  to  solve  the  problem  using 
insight  and  experiment  design. 

•  Save  money  and  try  to  “wing  it”  by  manually  tun¬ 
ing  the  parameters. 

For  highly  expensive  or  safety-critical  processes,  the 
first  option  is  always  preferable,  leaving  only  the  ques¬ 
tion  of  which  are  the  best  analysis  and  experiment 
design  tools  for  the  statistician  to  use.  This  area  is 
heavily  investigated  by  the  academic  statistics  com¬ 
munity. 

But  there  are  also  many  situations  in  which  it  is  im¬ 
practical  to  enlist  human-aided  analysis  during  opti¬ 
mization,  for  example  if  a  vehicle  engine  self-tunes 
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during  driving.  And  there  are  many  other  situations 
in  which  the  potential  benefit  from  optimization  is  too 
small  to  justify  paying  for  expert  professional  analysis. 
In  such  cases,  it  is  tempting  to  ask;  Can  “black  box” 
automated  methods  optimize  noisy  systems?  If  practi¬ 
cal  black  box  methods  are  found,  they  could  be  widely 
used.  Somewhat  fancifully,  this  could  lead  to  the  even¬ 
tual  inclusion  of  Black  Box  Optimizer  chips  within  a 
huge  range  of  consumer  products,  from  vehicle  engines 
and  industrial  equipment  down  to  refrigerators,  toast¬ 
ers,  and  toys. 

In  the  next  section  we  discuss  variants  of  the  Black 
Box  Noisy  Optimization  task.  Then  in  Section  3  we 
discuss  existing  approaches.  After  that  we  present  and 
evaluate  Q2,  a  new  algorithm. 


2  VARIANTS  OF  NOISY 
OPTIMIZATION 

The  generalized  noisy  optimization  task  summarized 
by  Equation  1  has  many  variants.  For  instance,  in 
some  domains  each  experiment  is  a  lengthy  procedure, 
and  so  there  is  ample  computation  time  between  ex¬ 
periments.  In  other  domains,  experiments  are  very 
quick,  leaving  an  optimizer  little  time  to  make  its  rec¬ 
ommendations.  The  specifics  of  the  domain  determine 
which  methods  are  appropriate.  The  following  factors 
need  to  be  considered; 

•  Minimize  regret  or  the  number  of  experi¬ 
ments?  Do  we  pay  a  constant  cost  per  exper¬ 
iment,  or  do  experiments  with  poor  results  cost 
us  more?  In  scenarios  such  as  tuning  the  parame¬ 
ters  for  an  algorithm,  or  optimizing  a  test  plant  in 
which  all  products  will  be  discarded,  the  cost  per 
experiment  may  be  constant.  But  in  a  task  such 
as  minimizing  the  fuel  consumption  of  a  running 
engine,  some  experiments  cost  more  than  others. 
Here,  we  focus  on  simply  minimizing  the  number 
of  experiments.  Note  that  this  presumes  that  we 
are  not  risk-averse;  there  is  no  penalty  for  per¬ 
forming  highly  unpredictable  experiments. 

•  How  much  computer  time  is  available  to 
choose  experiments?  If  experiments  are  very 
cheap  and  very  quick,  then  an  algorithm  that 
needs  extensive  CPU  time  to  select  the  ideal  next 
experiment  could  still  be  inferior  to  one  that  re¬ 
quires  only  a  fraction  of  a  second  to  suggest  a 
reasonable-but-less-than-ideal  experiment.  Here, 
we  assume  that  experiments  are  costly  enough  (in 
time  or  money)  that  it  pays  to  choose  them  care¬ 
fully.  But  the  Q2  algorithm  can  be  adjusted  to 
satisfy  any  desired  tradeoff  between  the  speed  and 
the  quality  of  proposed  experiments. 


•  Are  we  doing  local  or  global  optimization? 
Unless  we  have  strong  prior  knowledge,  global  op¬ 
timization  of  a  function  of  more  than  a  couple 
of  inputs  requires  a  very  large  number  of  exper¬ 
iments.  Q2  is  only  designed  to  find  a  local  opti¬ 
mum,  though  empirically  it  appears  to  be  good  at 
discovering  the  global  optimum. 

•  Crm  we  re-use  old  data?  Many  algorithms 
have  a  “current  location”  or  “current  set  of  k  re¬ 
cent  evaluations”  but  otherwise  disregard  earlier 
evaluations.  Q2,  however,  can  exploit  any  exist¬ 
ing  data,  including  previous  evaluations  obtained 
by  other  experimental  methods. 

In  this  paper  we  also  assume  that  there  are  no  long 
term  dynamics,  i.e.  the  output  of  the  n’th  experiment 
depends  only  on  the  n’th  chosen  x,  not  on  previous  x 
values  or  the  time.  Unlike  [2,  6]  we  only  try  to  find 
the  optimum,  not  to  model  the  g  function. 

3  POSSIBLE  APPROACHES 

Many  disciplines  have  methods  that  are  relevant  to 
noisy  optimization.  Space  permits  only  a  brief  survey. 

Numerical  analysis:  Numerical  methods  such  as 
Newton- Raphson  or  Levenberg-Marquardt  [11]  have 
fast  convergence  properties,  but  they  must  be  applied 
carefully  to  prevent  oscillations  or  divergence  to  infin¬ 
ity,  which  violates  our  desire  for  black  box  autonomy. 
Furthermore,  current  numerical  methods  cannot  sur¬ 
vive  noise. 

Stochastic  approximation:  The  algorithm  of  [12] 
finds  roots  without  the  use  of  derivative  estimates. 
Keifer-Wolfowitz  (KW)  [5]  is  a  related  algorithm  for 
noisy  optimization.  It  estimates  the  gradient  by  per¬ 
forming  experiments  in  both  directions  along  each  di¬ 
mension  of  the  input  space.  Based  on  the  estimate, 
it  moves  its  experiment  center  and  repeats.  It  uses 
decreasing  step  sizes  to  ensure  convergence.  KW’s 
strengths  are  its  aggressive  exploration,  its  simplicity, 
and  that  it  comes  with  convergence  guarantees.  How¬ 
ever,  it  can  attempt  wild  experiments  if  there  is  noise, 
and  discards  the  data  it  collects  after  each  gradient 
estimate  is  made.  Amoeba  (see  below)  is  a  similar 
approach,  but  in  our  experience  is  superior  to  KW. 

Amoeba  search:  Amoeba  [II]  searches  k-d  space 
using  a  simplex  (i.e.  a  fc-dimensional  tetrahedron). 
The  function  is  evaluated  at  each  vertex.  The  worst¬ 
performing  vertex  is  reflected  through  the  hyperplane 
defined  by  the  remaining  vertices  to  produce  a  new 
simplex  that  has  moved  up  the  estimated  gradient.  In¬ 
genious  simplex  transformations  let  the  simplex  shrink 
near  the  optimum,  grow  in  large  linear  zones,  and  ooze 
along  ridges. 
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Experiment  design  response  surface  meth¬ 
ods:  Current  RSM  practice  is  described  in  the  clas¬ 
sic  reference  [1],  It  proceeds  by  cautious  steepest  as¬ 
cent  hill-climbing.  A  region  of  interest  (ROI)  is  estab¬ 
lished  at  a  starting  point  and  experiments  are  made 
at  positions  that  can  best  be  used  to  identify  local 
function  properties  with  low-order  polynomial  regres¬ 
sion.  Much  of  the  RSM  literature  concerns  experi¬ 
mental  design — deciding  where  to  take  data  in  order 
to  acquire  the  lowest  variance  estimate  of  the  poly¬ 
nomial  coefficients  in  a  fixed  number  of  experiments. 
When  the  gradient  is  estimated  confidently,  the  ROI 
is  moved  accordingly.  Quadratic  regression  locates  op¬ 
tima  within  the  ROI,  and  diagnoses  ridge  systems  and 
saddle  points.  The  strength  of  RSM  is  that  it  avoids 
changing  operating  conditions  based  on  inadequate  ev¬ 
idence,  but  moves  once  the  data  justifies  it.  A  weak¬ 
ness  of  RSM  is  that  human  judgment  is  needed:  it  is 
not  an  algorithm,  but  a  manufacturing  methodology. 

Evolutionary  computation  and  learning  au¬ 
tomata:  Methods  such  as  genetic  algorithms  begin  by 
sampling  uniformly,  but  then  bias  later  samples  in  fa¬ 
vor  of  the  experiments  that  had  good  outcomes.  There 
is  a  vast  literature  of  refinements  of  such  methods. 
These  approaches  need  thousands,  sometimes  millions, 
of  evaluations,  because  they  attack  a  different  problem: 
Global  Optimization,  usually  for  noise-free,  cheap-to- 
evaluate  criteria. 

PMAX:  PMAX  is  a  simple,  effective  algorithm. 
Based  on  the  data  from  the  experiments  so  far,  it  uses 
a  non-linear  function  approximator  to  estimate  the  un¬ 
derlying  function  </(x).  The  next  experiment  is  taken 
at  the  point  that  maximizes  the  estimate  of  (/.  This  ap¬ 
proach  has  been  used  with  a  decision-tree  approxima¬ 
tor  [13],  with  neural  nets  (in  many  commercial  prod¬ 
ucts),  and  with  locally  weighted  regression  [9].  Vari¬ 
ations  of  PMAX  include  taking  the  next  experiment 
not  at  the  predicted  optimum,  but  instead  where  the 
confidence  intervals  are  widest  [6],  or  where  the  top 
of  the  confidence  interval  is  maximized  [9],  or  in  ac¬ 
cordance  with  the  Interval  Estimation  heuristic  [4]  or 
similar  criteria  [13]. 

Empirically,  we  have  found  that  PMAX  using  locally 
weighted  regression  as  the  function  approximator  is 
often  faster  than  more  sophisticated  alternatives  [9]. 
However  it  has  some  serious  drawbacks: 

•  In  conventional  function  approximation  one  must 
solve  the  bias-variance  tradeoff.  This  is  often  de¬ 
termined  automatically  using  cross-validation  [8], 
but  this  proves  difficult  with  a  set  of  very  few, 
weirdly  distributed  datapoints  obtained  during 
optimization.  Empirically  we  have  observed  dis¬ 
mal  performance  when  attempting  this.  In  addi¬ 
tion,  conventional  approaches  search  for  the  best 
model  over  the  whole  data  range,  whereas  we  only 


need  our  model  to  be  accurate  in  the  vicinity  of 
the  optimum. 

•  PMAX  is  very  expensive.  It  needs  to  train  a 
function  approximator  each  time  an  experiment  is 
made,  and  then  the  approximate  function  must  be 
numerically  optimized  to  produce  the  suggested 
experiment. 

•  PMAX  can  get  stuck  in  hallucinated  optima  since 
it  is  not  choosing  experiments  to  give  the  most 
information  (in  the  way  that  RSM  docs). 

4  THE  Q2  ALGORITHM 

The  Q2  algorithm  is  an  attempt  to  combine  the 
strengths  of  Newton’s  method  (superlinear  conver¬ 
gence),  RSM  (using  estimates  of  significance  in  the  face 
of  noise),  and  PMAX  (exploiting  all  available  data). 
Let  us  first  outline  the  structure  of  the  Q2  algorithm, 
before  discussing  its  details: 

1.  Input  a  set  of  previous  experimental  results 

(xi -tyi),(x2 -ty2),...,(xn -tyn)  (2) 

and  HR:  a  hyper-rectangular  portion  of  input 
space  over  which  the  optimization  is  constrained 
to  take  place. 

2.  Select  a  convex  Region  Of  Interest  (ROI)  within 
HR  such  that: 

•  The  constrained  optimum  within  HR  is  ex¬ 
pected  to  lie  within  ROI. 

•  There  is  no  evidence  to  contradict  the 
assumption  that  the  function  is  well- 
approximated  by  a  quadratic  within  ROI. 

3.  Select  a  useful  experiment  to  take  within  ROI. 

4.  Return  the  experiment,  the  estimated  location  of 
the  optimum,  and  (optionally)  other  information 
such  as  the  ROI  and  a  regression  analysis  of  the 
local  quadratic. 

In  typical  operation,  the  suggested  experiment  will 
be  performed,  we  will  add  the  new  datapoint  to  the 
dataset,  and  return  to  Step  2. 

Step  2:  Selecting  the  ROI 

Step  2  begins  by  generating  a  sequence  of  candidate 
Regions  Of  Interest,  ROR ,  ROR,  •  •  • ,  ROIj , . . .  from 
which  the  final  ROI  will  be  selected.  The  generated 
sequence  has  the  properties  that 

ROh:=HR  and  ROR  D  ROR+i  (3) 

where  ROIj+i  is  determined  by  cutting  away  an  un¬ 
promising  subregion  of  ROR.  How  is  the  cut  deter¬ 
mined?  Let  us  consider  an  example. 
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Figure  1  shows  a  Gaussian  function  of  two  inputs.  Sup¬ 
pose  HR  is  set  to  be  the  full  square  region  depicted 
in  the  figure,  and  suppose  we  have  available  the  thirty 
noisy  datapoints  that  are  also  shown.  Call  this  dataset 
DSi-  We  can  fit  a  quadratic  to  DSi.  Write 

yk  =  c-\-  b^Xfc  -t-  ixj  Axfe  (4) 

where  A  is  symmetric,  or,  equivalently, 

2  2 

Vk  =  C-{-biXkl-\-b2Xk2+  iO'nXkl  +  (^12XklXk-2  +  ^(l22Xk2 

.(5) 

The  regression  is  a  matter  of  simple  matrix  manipu¬ 
lation.  Write  Zk  =  the  vector  of  polynomial  terms  for 
the  feth  input  point,  x*. 

Zfc  =  (1,  Xkl,  Xk2,  xli,XklXk2,  XI2)  (6) 

Write  Z  =  a  matrix  whose  kth  row  is  Zfc,  and  write 
Y  =  a  vector  whose  kth.  element  is  yk-  Finally  define 

/3  =  (c,  61, 62,  5“i1)“i2,  5^22)^  (7) 

as  the  regressed  coefficients.  Then  using  Bayesian  re¬ 

gression  with  non-informative  priors  on  and  (the 
estimated  Gaussian  noise),  we  have  the  MAP  of  (also 
the  maximum  likelihood  value  in  this  case)  as 

/3  =  (Z’’Z)-^Z^Y  (8) 

In  practice,  if  the  information  is  known,  we  can  put 
Gaussian  priors  on  the  coeflBcients  and  an  inverse- 
Gamma  prior  on  the  noise.  For  our  dataset  the  re¬ 
sulting  quadratic  approximation  is  shown  in  Figure  2. 
Note  that  because  the  underlying  function  is  so  far 
from  quadratic,  this  is  a  poor  fit. 

Q2  evaluates  each  of  the  datapoints  in  DSi  using  the 
quadratic,  producing  the  values  of  Equation  5.  Let 
(xfc(i)) y*(i))  be  the  datapoint  that  is  predicted  to  be 
the  worst,  i.e.  A:(l)  =  argmin;,yfc.  It  will  be  used  to 
define  a  cut  of  ROh-  We  look  at  the  direction  of  the 
steepest  gradient,  Vy,  of  the  quadratic  at  Xfc(i),  and  we 
cut  using  the  half-plane  perpendicular  to  this  direction 
so  that 

ROh  =  ROh  n  {x  I  (x  -  Xfe(i)).di  >  0}  (9) 

where  di  =  Vy  evaluated  at  Xk^i). 

In  Figure  2,  the  worst  point  according  to  the  quadratic 
is  at  the  top  left,  and  with  some  effort  the  resulting 
cut-plane  can  be  seen. 

Why  do  we  use  the  above  approach?  We  want  to  use 
our  unreliable  (probably  biased)  quadratic  to  tell  us 
how  to  reduce  the  ROI.  We  assume  that  even  if  the 
quadratic  is  a  poor  model  for  y,  it  will  be  adequate 
to  predict  an  unpromising  location  for  the  optimum. 
Why  pick  the  point  with  the  predicted  worst  value  in¬ 
stead  of  the  actual  worst  value?  Because  the  actual 
values  are  noisy,  meaning  that  an  unlucky  datapoint 
could  be  misleadingly  removed. 


We  have  described  how  ROh  is  constructed  from 
ROh-  In  general,  ROIj+i  is  constructed  from  ROIj 
using  a  similar  recipe:  set  DSj+i  =  DSj  —  {xk{j),yk{j)), 
do  a  regression  using  dataset  DSj+i  (which  will  be 
less  biased  than  using  DSj),  and  cut  using  the  point 
that  the  new  regression  predicts  will  be  worst.  Fig¬ 
ure  3  shows  the  approximation  that  results  after  the 
first  cut  has  been  made  (giving  a  less  biased  fit  than 
Figure  2),  and  also  shows  the  second  cut.  Figure  4 
shows  what  remains  after  the  twelfth  cut:  the  fit  is 
now  good,  because  it  is  only  based  on  datapoints  near 
the  quadratic-shaped  optimum.  Figures  5-7  use  a  big¬ 
ger  dataset  and  an  extreme  ridge  system. 

At  this  point  Q2  has  generated  a  series  of  candidate 
regions,  ROh ,  ROh  ■  ■  -  To  decide  which  to  select,  we 
perform  regression  analysis  on  the  quadratics  in  each 
of  the  ROIs.  As  j  increases,  ROIj  shrinks  and  is  based 
on  fewer  datapoints.  So,  as  j  increases,  ROIj ’s  bias  de¬ 
creases  and  its  variance  increases.  We  select  the  ROIj 
with  the  best  tradeoff  using  the  criterion:  Choose  the 
smallest  ROI  for  which  Bayesian  regression  analysis 
is  confident  about  the  location  of  the  optimum,  and  for 
which  the  optimum  is,  with  high  probability,  inside  the 
ROI.^ 

The  results  of  this  criterion  are  shown  in  Figures  8- 
13.  With  fewer  or  noisier  datapoints,  larger  ROIs  are 
chosen.  The  shape  of  the  chosen  ROIs  nicely  reflects 
the  shape  of  the  local  ridge  system  (Figure  7).  If  ir¬ 
relevant  inputs  are  included,  the  ROI  chosen  by  Q2 
tends  to  stretch  to  ignore  irrelevant  dimensions  (pic¬ 
tures  omitted  because  of  space  constraints) . 

Step  3:  Choosing  the  experiment 

Once  the  ROI  is  determined,  the  estimated  optimum 
is  easily  obtained  as 

Xopt  =  -A“^b  (10) 

(assuming  the  quadratic  fit  has  revealed  a  maximum, 
meaning  A  is  negative-definite).  Xopt  is  not  necessar¬ 
ily  the  best  place  to  experiment  in  order  to  gain  useful 
new  information.  Instead,  we  investigated  these  op¬ 
tions: 

1.  Put  experiment  at  Xopt- 

2.  Choose  a  random  point  within  ROI. 

3.  Choose  the  point  in  RO/that  is  predicted  to  most 
reduce  the  uncertainty  about  the  location  of  the 

‘This  is  achieved  by  taking  the  joint  posterior  distri¬ 
bution  (norm£j-gamma)  on  the  noise  and  the  coefficients 
of  the  quadratic  form,  and  then  (via  Monte  Carlo  sam¬ 
pling)  seeing  whether  at  least  t  =  98%  of  the  samples  lie 
in  the  ROI  and  whether  the  expected  regret  of  committing 
to  the  optimum  is  below  a  threshold  (2%  of  the  range  of 
output  veilues).  Empirically  these  threshold  choices  are  not 
performeince- critical. 
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Figure  1:  A  function  of  two  inputs.  The  optimum  is  at  (0.75,0.25). 

It  is  a  Gaussian  bump,  and  hence  very  flat  more  than  about  0.4  units 
of  distance  from  the  optimum.  Also  shown  are  30  noisy  datapoints. 
These  were  generated  with  uniformly  random  (x,y)  coordinates,  with 
2  (height)  set  to  f(x,y)  plus  Gaussian  noise  with  standard  deviation 
0.1. 

optimum. 

4.  Choose  the  point  in  ROI  that  keeps  the  regres¬ 
sion  as  orthogonal  [1]  as  possible,  mimicking  es¬ 
tablished  RSM  practice. 

5.  Choose  the  point  in  ROI  as  far  away  from  any 
previous  datapoints  (in  or  out  of  ROT)  as  possible. 

Option  5  is  best  empirically.  This  is  because  options 
3  and  4,  despite  their  elegance,  usually  choose  exper¬ 
iments  at  the  edge  of  the  ROI,  reducing  the  opportu¬ 
nity  for  future  cuts  to  shrink  future  ROh.  Option  1 
quickly  becomes  stuck,  and  option  2  frequently  wastes 
experiments. 

Details 

In  this  short  paper,  many  details  have  been  omitted. 
Some  regressions  predict  a  minimum  or  a  saddlepoint, 
instead  of  a  maximum.  We  have  special-purpose  tech¬ 
niques  to  deal  with  this.  The  Bayesian  analysis  is 
largely  standard,  and  also  omitted:  see  [3]  for  more  de¬ 
tails.  Some  confidence  measures  require  Monte  Carlo 
integration.  These  details  will  be  discussed  in  a  forth¬ 
coming  technical  report  [10]. 

5  RESULTS 

We  begin  by  comparing  Q2  with  four  versions  of 
Amoeba  and  three  versions  of  PMAX  on  the  func¬ 
tion  /i  from  Figure  1  with  noise  of  0.3  added  to  each 
evaluation^.  Amoeba  is  the  classic  search  algorithm 

^  These  tasks  are  available  from 

http://www.cs.cmu.edu/~AUTON 


Figure  2;  The  best-fitting  global  quadratic  regression  approximation 
obtained  by  least  squares  regression  on  the  30  datapoints.  The  worst¬ 
scoring  datapoint  is  in  the  top  left. 


from  [11].  Amoeba2  is  the  same  except  it  is  made  re¬ 
sistant  to  noise  by  doing  two  evaluations  and  taking 
their  average  at  each  simplex  vertex.  Amoebad  and 
AmoebaS  similarly  average  four  and  eight  evaluations 
at  each  vertex.  All  the  Amoebas  begin  with  a  medium¬ 
sized  simplex  started  randomly  in  input  space. 

The  results  are  in  Figure  14.  In  this  (and  all  subse¬ 
quent  experiments)  we  performed  25  independent  runs 
of  each  optimizer,  with  each  run  consisting  of  60  ex¬ 
periments.  As  well  as  selecting  the  datapoints  for  the 
experiments,  at  every  stage  the  optimizers  also  gave 
their  estimate  of  the  location  of  the  optimum.  To  a.s- 
sess  the  various  optimizers,  we  wish  to  compare  how 
good  they  are  at  estimating  the  optimum,  and  so  we 
look  at  the  true  value  of  the  underlying  function  at 
these  estimates  of  the  optimum.  For  the  ith  run  of  a 
particular  optimizer,  let  s,-  denote  the  mean  of  the  true 
values  at  the  estimates  of  the  optimum.  The  figures  in 
the  left  hand  column  are  the  mean  s,-  value  of  the  opti¬ 
mizer  over  all  25  runs  (i.e.  (X!!i  ®i)/25)-  These  values 
are  also  drawn  graphically  in  the  same  column:  the 
further  to  the  right  the  dot  lies,  the  better  the  mean 
score.  The  horizontal  lines  depict  the  95%  confidence 
intervals  on  the  mean.  The  right  hand  column  shows 
the  mean  performance  of  the  optimizer  on  the  final  15 
of  the  60  experiments.  Unsurprisingly,  all  methods  do 
better  in  later  experiments,  so  the  right  hand  means 
are  higher. 

Figure  14  shows  that  Q2  outperforms  all  the  other 
methods  on  this  problem.  Amoeba4  is  the  best  of  the 
Amoebas;  it  is  less  affected  by  noise  than  Amoeba  and 
Amoeba2,  but  it  makes  better  progress  than  AmoebaS, 
which  wastes  8  evaluations  on  every  vertex. 


Figure  3:  After  the  worst-scoring  point  is  removed  from  the  regression,  Figure  4:  After  12  cuts  the  remaining  datapoints  (those  inside  the 
we  have  the  following  fit  to  the  remaining  29  datapoints.  The  worst  convex  region  defined  by  the  cuts)  are  relatively  close  to  the  optimum, 
predicted  point  among  these  is  halfway  up  along  the  left  edge.  Note  and  the  resulting  local  quadratic  regression  is  an  excellent  local  ap- 
the  cut  that  it  causes.  proximation. 


Figure  5:  Another  function  of  two  inputs.  Figure  6;  After  the  first  150  cuts,  the  region  Figure  7:  After  the  first  180  cuts,  the  region 
The  optimum  is  on  the  banana-shaped  ridge  of  interest  nicely  surrounds  the  ridge.  of  interest  is  smaller  still,  yet  continues  to  sur- 

at  (0.75,0.2).  200  datapoints  are  shown  (their  round  the  true  optimum, 

heights  omitted). 

Table  1  (shown  later)  gives  results  for  the  2d-functions  added  at  a  certain  rate  (a  parameter),  and  the  output 

of  Figures  5,  15,  and  16  for  noise  levels  of  0  and  0.3.  passes  through  a  cooling  tunnel  to  wait  on  a  holding 

With  no  noise,  the  one-evaluation-per-step  version  is  belt.  While  waiting,  color  may  change.  When  the  belt 

always  the  best  Amoeba.  With  noise,  the  best  Amoeba  fills  beyond  a  certain  level  (a  parameter),  production 

is  problem  specific.  The  best  PMAX  is  also  problem  halts.  Customer  demand  randoinly  consumes  material 

specific.  Q2  adapts  well  to  noise  and  to  differing  levels  on  the  holding  belt.  The  yield  is  the  amount  of  ma- 

of  function  complexity.  Q2  is  beaten  by  the  Global  and  terial  that  reaches  the  customer  with  color  lying  in  an 

mediumly  local  PMAX  for  the  noisy  pure  quadratic  acceptable  tolerance  range.  This  is  a  very  noisy  task. 

f3{xi,X2).  In  all  other  cases  Q2  wins,  but  its  main  The  yield  is  a  highly  non-quadratic  function;  one  in¬ 
strength  is  autonomy:  unlike  Amoeba  and  PMAX  no  put  is  almost  irrelevant,  the  others  are  all  iniportant, 

problem  specific  parameter  needs  to  be  chosen  to  make  and  two  of  the  inputs  must  run  to  their  maximurn  le- 

Q2  perform  well.  gal  value  for  best  performance.  The  results  are  given 

.  1  ill  Figure  18,  and  show  a  significant  win  for  Q2.  Q2 

Figure  17  shows  a  simulated,  sanitized  version  of  a  real  PMAX’s  also  have  far  more  repeatable  results 

industrial  process.  Liquids  enter  a  tank  at  a  certain  Amoebas. 

rate  (a  parameter)  and  a  certain  mix-ratio  (a  parame¬ 
ter)  unless  the  tank  is  above  a  certain  level  (a  parame-  We  also  applied  conventional  RSM  to  this  task,  using 

ter).  They  react  causing  a  color  dependent  on  the  tank  a  star  design  prescribed  by  [1].  The  star  occupied  the 

mix-ratio  and  the  time  spent  in  the  tank.  Thickener  is  hyperrectangle  defined  by  the  legal  ranges  of  values  for 


392  Moore,  Schneider,  Boyan,  and  Lee 


Figure  8:  The  region  of  interest  selected  for 
the  function  of  Figure  1  given  a  dataset  of  only 
10  points. 


Figure  11:  The  region  of  interest  selected  for 
the  function  of  Figure  1  given  a  dataset  of  30 
points,  with  no  noise. 


Figure  9:  The  region  of  interest  when  given 
30  datapoints. 


Figure  12:  The  region  of  interest  when  noise 
with  std.  dev.  <r  =  0.5  is  added  to  the  obser¬ 
vations. 


Figure  10:  The  region  of  interest  when  given 
50  datapoints. 


2.0. 


each  input.  It  needed  76  evaluations,  but  the  chosen 
optimum  had  a  yield  below  10  units;  worse  than  all 
the  other  methods,  indicating  that  the  assumption  of 
a  global  quadratic  is  inadequate  in  this  domain. 

Next,  we  examine  a  domain  where  experiments  are 
time-consuming.  Figure  19  shows  a  generalization  of 
the  multi-buffer  machine  task  described  in  [7]  (this 
makes  10  products  instead  of  5).  There  are  two  in¬ 
puts  defining  a  simple  parameterized  policy  for  when 
to  service  the  machine.  Services  are  costly,  but  un¬ 
scheduled  breakdown  is  much  worse.  This  task  is  eval¬ 
uated  by  a  computationally  expensive  simulation;  for 
each  setting  of  the  two  inputs,  we  perform  10000  simu¬ 
lation  steps  to  evaluate  the  performance.  Evaluations 
are  very  stochastic  (with  highly  non-Gaussian  noise). 
The  results  are  shown  for  runs  of  only  24  experiments. 
Q2  learns  a  good  policy  in  these  24  experiments,  i.e. 
a  total  of  only  24  x  10000  simulation  steps.  This  com¬ 
pares  favorably  with  the  tens  of  millions  of  simulation 
steps  needed  for  reinforcement  learning  in  [7],  but  Q2 
is  unlikely  to  find  as  good  a  policy  as  their  semi-MDP 
formulation. 

The  final  results  show  Q2  being  used  for  root-finding 


instead  of  optimization.  The  hand  position  in  Fig¬ 
ure  21  is  a  noisy  function  of  9i  and  The  task  re¬ 
quires  us  to  achieve  the  goal  hand  position.  Although 
space  permits  no  details,  the  version  of  Q2  for  root  (or 
target)  finding  uses  linear  instead  of  quadratic  regres¬ 
sion  in  its  ROIs.  The  results  are  shown  in  Figure  22. 
Figure  23  shows  the  results  when,  on  each  experi¬ 
ment,  the  target  position  is  varied  randomly  within  the 
workspace.  Amoeba,  a  pure  optimization  method  for 
a  fixed  goal,  is  no  longer  applicable  here,  but  PMAX 
and  Q2  can  still  be  used  because  their  decision  making 
simply  requires  a  dataset  of  previous  experiences.  Q2’s 
ability  to  tune  its  regions  of  interest  decisively  beats 
all  PMAXs. 


Mean  over  all  100  trials 

Mean  over  last  25  trials 

PmaxGlobal 

-0.417 

-0.368 

PmaxLocal 

-0.402 

-0.342  — 

PmaxVLocal 

-0.475-^ 

-0.418 

Q2  (Linear) 

-0.042  » 

-0.021  « 

Figure  23:  Performance  on  kinematics  when  the  target 
varies  during  each  experiment. 
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Mean  over  all  60  trials 

Mean  over  last  15  trials 

Amoeba 

0.999 — t — 

1.040 — * — 

Amoeba2 

1.130  — * — 

1.234  - 1 - 

Amoeba4 

1.181 

1.445  — I — 

AmoebaS 

0.950—*— 

1.209  — * — 

PmaxGlobal 

1.667  + 

1.812  4- 

PmaxLocal 

1.681  ♦ 

1.846  * 

PmaxVLocal 

1.517  ♦ 

1.691  4- 

Q2 

1.716  * 

1.894  4 

Figure  14;  Performance  on  /i(®i,jP2)  frorn  Figure  1. 


Figure  15:  h(xi,X2)-  a  simple  (pure 
quskdratic)  two-input  function  with  an 
optimum  at  (0.5, 0.5). 


Figure  16:  /4(»i,®2)'  a  function  in 
which  the  only  relevant  direction  is  x+ 
y.  The  optima  lie  along  a  diagonal 
ridge. 


Noise 

0.0 
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0,3 
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Koan  oT«r  lut  IS  triala 

2.168  ^ 
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1.819 

2.250 

1.S76 
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1.844 

-*- 
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-4- 

1.866  4- 

2.116 

4- 

1.938  * 

2.268 

4 
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1.968  4 

2.476 

• 

Kean  over  all  60  trlala 

Mean  over  last  15  trlala 

1.633  — 

1.728 

— *— 

1.6S6  -4— 

1.839 

SS&SH 

1.479 

1.849 

1.182-*- 

1.802 

1.769  -*- 

1.909 

E3S39 

1.861  4- 

2.092 

♦ 

g23BiHl 

1.835  4 

2.117 

♦ 

1.859  4 

2.388 

♦ 

f3ixi,X2) 


fi.{xirX2)' 


ABoeba 

1.816 

• 

2.000  « 

1.631 

♦ 

1.999  * 

1.275 

-4- 

1.944  * 

0.700-*- 

1.1S8  -4— 

1.618 

4 

1.803  4 

1.691 

* 

1.870  * 

CSSSi 

1.663 

* 

1.908  « 

E _ 

1.730  * 

1.999  ( 

CSS! 

Bi3ii 

OSH 

uomu 

BSH 

^HEHI 

C9HH 

^BE3B 

Mean  over  all  60  trlala 

Mean  over  laat  15  trials 

O.BOB  ..  -4— 

0.861  — t — 

0.875  — 

1,009  - 1 — 

0.962  — 

1.388  — 1 — 

CSBI 

0.637-4- 

0.956  — ^ 

1.549  4> 

1.738  4- 

1.619  4 

1.843  4 

1.489  4 

1.756  * 

1.675  * 

1.947  * 

Mean  over  ail  60  trials 

Mean  over  last  15  triala 

Q2|||[|| 

1.871 

1.890 

BSIB 

1.904 

-4- 

1.954 

BSEB 

1.865 

-4-. 

1.936 

252223 

1.910 

•*• 

1.979 

♦ 

EiSSSSl 

1.911 

4> 

1.984  * 

RgggjHj] 

1.78? 

♦ 

1.B6B 

-4- 

51 

1.892 

1.944 

4* 

Table  1:  Optimization  results  for  seven  optimizers  on  three  problems  at  two  noise  levels. 


6  CONCLUSION 

This  paper  has  highlighted  the  importance  of  Black 
Box  Noisy  Optimization,  surveyed  possible  ap¬ 
proaches,  and  then  introduced  a  new  algorithm:  Q2. 

Algorithms  like  Newton’s  method,  golden  ratio  search 
and  conjugate  gradient  [11]  maintain  a  region  expected 
to  contain  an  optimum  and  in  which  future  experi¬ 
ments  will  occur.  Q2  tries  to  do  the  same  thing  with 
two  innovations.  First,  it  can  derive  a  ROI  from  a 
previous  dataset  irrespective  of  how  that  dataset  was 
collected.  Second,  Q2  can  survive  noise.  Q2  is  also 
related  to  RSM  and  traditional  instance-based  learn¬ 
ing.  Future  Q2  work  will  include  trials  on  real  pro¬ 
cesses,  batching  experiments,  semi-quadratic  regres¬ 
sion  for  high  dimensions,  and  survival  of  slowly  time- 
varying  systems. 

Future  work:  This  algorithm  only  finds  local  optima: 
what  can  be  done  to  encourage  further  exploration  for 
alternative  optima?  We  also  hope  to  produce  a  for¬ 
mal  characterization  of  when  this  approach  will  best 
work.  The  main  limitation  is  that  the  computational 
cost  grows  rapidly  with  the  number  of  inputs,  and  the 


current  Q2  is  unlikely  to  be  useful  above  10  inputs. 
We  have  begun  investigation  into  versions  applicable 
to  hundreds  of  inputs. 
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Finllt  capsdty  bulTcn  UnpredlcUNt 


Riw 

mtCeriab 

^ . .  . . .  '  s 

Unreliable 

Multipurpose 

Machine 

DDOO  ■ 

,  l«™liirt  R  - 

,000000 _ IProduetC - >• 

I CH  Picnrn  i  prodoctn - 

■  AAA _ 1  ProdiictE - 

— ► 

uAAAJfe _ 1  ProdactF*"^ 

— ► 

■  x  y  X  XX  X  y  i Pr«i.c.c^ 

_  Product  1  — ^ 

.MMMMMMMMriM _ i 

Figure  19:  A  multi-buffer  servicing  task  similar  to  those  described 


in  [7], 


Noise  (+/•) 


Mean  over  all  60  trials 

Mean  over  last  15  trials 

Amoeba 

25.297  - • - 

27.130  - # - 

Ainoeba2 

27.412  — $ — 

33.216  - # - 

Amoeba 4 

22.302  — « — 

26.534  — I — 

Amoebafi 

19.858—#— 

21.411  — 

PmaxGlobal 

28.006  4 

37.237  ^ 

PmaxLocal 

27.634  ^ 

37.952 

PmaxVLocal 

23.012  • 

27.105  -#- 

Q2 

36.334  4- 

45.589 

Figure  18:  Performance  on  the  simulated  production  process. 
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Figure  20.  Performance  at  the  multi-buffer  task. 
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Figure  22:  Performance  on  the  kinematics  task. 
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Abstract 

We  apply  various  generalizations  of  weighted 
majority  prediction  algorithms  for  on-line 
prediction  of  binary  relations  to  the  problem 
of  predicting  personal  preferences  over  infor¬ 
mation  contents,  which  is  a  key  issue  in  col¬ 
laborative  filtering.  Note  that  the  collabora¬ 
tive  filtering  problem  can  be  casted  as  learn¬ 
ing  a  binary  relation  between  the  users  (as 
the  rows)  and  the  contents  (as  the  columns). 
The  original  prediction  algorithm  of  Gold¬ 
man  and  Warmuth  [GW95]  makes  its  pre¬ 
diction  by  majority  voting  by  the  rows  with 
observed  data  in  the  same  column,  weighted 
by  the  believed  similarity  between  the  rows. 
In  the  present  paper,  we  propose  a  general¬ 
ization  ‘G-Learn-Relation’  of  their  algorithm 
to  the  multi-valued  setting,  and  empirically 
demonstrate  that  it  performs  better  than  ex¬ 
isting  filtering  methods  based  on  correlation 
coefficients,  both  on  simulated  and  real  data. 
The  performance  comparison  was  done  in 
terms  of  the  total  number  of  prediction  mis¬ 
takes  and  the  measures  of  precision  and  re¬ 
call.  Additionally,  we  propose  a  version  of 
G-Learn-Relation  that  makes  use  of  indirect 
evidence  available  as  believed  similarity  be¬ 
tween  other  rows,  and  another  version  in 
which  both  row  similarity  and  column  sim¬ 
ilarity  are  used  for  prediction.  In  both  cases, 
significant  improvement  was  observed  in  ex¬ 
periments  involving  simulated  data.  Finally, 
we  give  a  theoretical  performance  guarantee 
for  G-Learn-Relation  in  terms  of  an  upper- 
bound  on  the  worst  case  number  of  mista.kes, 
which  together  with  a  lower  bound  on  the 
number  of  mistakes  made  by  a  correlation- 
based  method  establishes  that  its  worst  case 
performance  is  better  than  the  correlation- 
based  methods. 


1  Introduction 

We  apply  various  generalizations  of  weighted  majority 
prediction  algorithms,  proposed  in  the  context  of  on¬ 
line  prediction  of  binary  relations,  to  the  problem  of 
predicting  user’s  preferences  on  information  contents. 
This  is  a  key  issue  in  personalized  information  filter¬ 
ing,  an  area  that  is  gaining  increasing  attention  in  in¬ 
ternet  related  technology.  Information  filtering  tech¬ 
niques  known  in  the  literature  can,  for  the  most  part, 
be  classified  into  two  types.  One  is  the  contents-based 
approach  to  filtering,  which  is  based  on  the  features 
of  the  actual  contents  such  as  word  counts,  and  the 
other  is  the  so-called  collaborative  (or  social)  filter¬ 
ing  approach,  which  makes  use  of  similarities  between 
the  users  observed  in  the  past  scoring  data  represent¬ 
ing  their  preferences.  Methods  combining  the  two  ap¬ 
proaches  have  also  been  proposed.  In  this  paper,  we 
are  concerned  with  the  latter  approach,  namely  filter¬ 
ing  methods  that  are  based  solely  on  the  scores  given 
by  the  users  on  the  contents. 

Existing  methods  of  collaborative  filtering  [RISBR94, 
SM9.5]  make  use  of  correlation  coefficients.  In  this  ap¬ 
proach,  the  preference  of  a  user  on  a  particular  con¬ 
tent  is  predicted  by  taking  a  weighted  average  of  all 
scores  given  to  that  content  by  various  users  in  the 
past,  weighted  by  the  correlation  coefficients  between 
their  scores  and  those  of  the  user  in  question,  calcu¬ 
lated  using  scores  given  to  common  contents.^  These 
methods  are  based  on  a  reasonable  intuition  that  corre¬ 
lation  coefficients  can  quantify  the  similarity  between 
the  users’  preferences  but  one  shortcoming  of  this  ap¬ 
proach  is  that  the  estimation  confidence  of  the  corre¬ 
lation  coefficients  is  not  taken  into  account. 

As  a  way  to  address  this  issue,  we  resort  to  on-line  pre¬ 
diction  algorithms  for  binary  relations  proposed  and 
studied  in  the  areas  of  computational  learning  theory 


*It  has  been  reported  that  a  variant  of  this  method  that 
uses  a  threshold  and  a  fixed  average  do  the  best  among  var¬ 
ious  methods  based  on  the  correlation  coefficients  [SM95]. 
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and  machine  learning  [GRS93,  GW95,  NA95].  Note 
that  information  filtering  can  in  principle  be  viewed 
as  a  learning  problem  for  a  binary  relation,  in  which 
a  user  is  related  to  a  content  just  in  case  he  or  she 
prefers  it.  Such  a  binary  relation  can  be  represented 
by  a  0,1-valued  matrix,  in  which  the  rows  represent 
users  and  the  columns  represents  contents.  In  par¬ 
ticular,  we  make  use  of  the  weighted  majority  pre¬ 
diction  algorithm  proposed  and  analyzed  by  Goldman 
and  Warmuth  and  its  generalizations  [ALN95].  These 
methods  learn  weights  that  roughly  represent  the  be¬ 
lieved  similarities  between  the  rows  (and  columns)  and 
make  predictions  by  weighted  majority  voting.  Here 
we  further  extend  these  algorithms  so  as  to  handle  the 
cases  in  which  the  scores  are  not  necessarily  binary  but 
many- valued. 

First,  we  generalized  the  original  weighted  majority 
prediction  algorithm  ‘Learn-Relation’  [GW95]  into  the 
many-valued  setting.  (We  call  the  generalized  algo¬ 
rithm  ‘G-Learn-R.elation.’)  We  evaluated  the  perfor¬ 
mance  of  this  method  using  both  simulated  and  real 
data.  In  our  evaluation,  we  considered  that  a  predic¬ 
tion  whose  round-off  integral  value  is  at  most  one  off 
the  correct  value  to  be  correct  and  all  others  to  be 
mistakes.  The  experimental  results  indicate  that  G- 
Learn-Relation  out-performs  the  best  known  method 
based  on  correlation  coefficients  in  experiments  using 
both  simulated  and  real  data,  in  terms  of  the  total 
number  of  mistakes.  With  respect  to  more  widely  used 
measures  of  precision  and  recall,  G-Learn-Relation  had 
a  better  overall  performance  as  well.  Furthermore, 
it  was  found  that  G-Learn-Relation  is  less  sensitive 
to  the  choice  of  its  parameters,  as  compared  to  the 
correlation-based  methods. 

Next,  we  evaluated  the  effect  of  using  the  similarities 
between  the  columns  as  well  as  the  rows  in  making 
predictions.  It  has  been  verified,  using  several  two- 
dimensional  extensions  of  Learn-Relation,  that  such 
an  approach  can  improve  the  predictive  performance 
in  another  application  domain  [ALN95].  In  our  exper¬ 
iments  using  simulated  data,  the  effect  of  using  both 
rows  and  columns  was  observed  for  both  correlation- 
based  methods  and  for  G-Learn-Relation,  the  two- 
dimensional  extension  of  G-Learn-Relation  being  the 
most  favored.  With  respect  to  real  data,  however,  the 
effect  was  minimal.  This  may  be  attributable  to  the 
fact  that  the  real  data  used  in  our  experiments  had 
very  uneven  number  of  rows  and  columns  (48  rows 
and  277  columns). 

As  an  attempt  to  further  im]U'ove  the  performance  of 
G-Learn-Relation,  we  enhanced  its  ]u-cdiction  by  us¬ 
ing  indirect  evidence.  In  particular,  we  incorporate 
an  idea,  suggested  by  Lang  and  Baum  [LR97]  into 
the  weighted  majority  prediction  algorithm.  Their 
method,  which  they  call  ‘triple  row,’  is  based  on  the 


idea  that  ‘a  friend’s  friends  is  a  friend,  too’  (and  a 
friend’s  enemy  is  an  enemy,  too.  )  That  is,  in  deter¬ 
mining  the  similarity  between  two  rows,  we  take  into 
account  the  (dis)similarity  between  the  two  rows  and 
a  third  row.  Our  experimental  results  indicate  that 
this  enhancement  results  in  a  significant  performance 
improvement  on  simulated  data,  but  on  real  data  the 
effect  was  inconclusive. 

Finally,  we  give  a  theoretical  performance  guarantee 
for  G-Learn-Relation  in  terms  of  an  upper  bound  on 
the  worst  case  number  of  mistakes  it  makes.  We 
also  show  a  lower  bound  on  the  worst  case  number 
of  mistakes  made  by  the  correlation-based  method 
and  establish  that  the  worst  case  performance  of  the 
weighted  majority  type  algorithms  is  better  than  that 
of  the  correlat  ion-based  methods. 

2  The  problem  formulation 

Collaborative  filtering  using  methods  that  are  based 
solely  on  the  scores  given  by  the  users  on  the  con¬ 
tents  can  be  viewed  as  an  on-line  prediction  problem 
for  binary  relations  (and  multi-valued  functions).  The 
target  binary  relation  (or  function)  can  be  represented 
by  a  matrix  M,  whose  (.j-entry  represents  the  score 
given  by  user  i  on  contents  j.  On-line  learning  pro¬ 
ceeds  as  follows.  At  any  given  time  t,  the  learning  al¬ 
gorithm  is  given  an  arbitrary  pair  i,j  and  predicts  its 
value  as  Mij,  based  on  an  observation  matrix  OK  Here 
an  observation  matrix  O  in  general  satisfies  Oij  =  M,j 
whenever  the  i,j  entry  has  been  observed,  and  Oij  =  * 
otherwise.  The  learner  is  then  given  the  act\ial  value  of 
Mij,  and  O’  is  updated  (to  O’"*"^)  accordingly.  Start¬ 
ing  initially  with  O’*  whose  elements  are  all  ■*=,  the 
above  proce.ss  is  repeated  until  the  matrix  is  fully  ob¬ 
served,  namely  until  O'  =  M.  In  our  experiments, 
we  assume  that  the  scores  are  integers  between  1  and 
5  (5  being  the  highest  score)  and  prediction  is  done 
with  a  real  number.  A  prediction  is  considered  correct 
if  its  round-off  vahie^  is  at  most  1  different  from  the 
correct  value.  The  performance  of  an  on-line  learn¬ 
ing  algorithm  is  measured  in  terms  of  t  he  total  num¬ 
ber  of  mistakes  in  the  entire  trial  sequence,  often  as 
a  function  of  various  parameters  quantifying  the  size 
of  the  problem.  These  include  the  numbers  of  rows 
and  columns  as  well  as  the  numbers  of  row  types  and 
column  types,  where  two  rows  /,  /'  are  said  to  belong 
to  the  same  tyjre,  if  they  agree  in  all  columns,  namely 
if  Mjj  =  M,,j  holds  for  all  j.  (The  column  types  are 
similarly  defined.) 

3  Algorithms  Employed 

In  this  section,  we  describe  the  generalized  weighted 
majority  algorithms  we  projiose  in  this  paper, 

^For  example,  the  round-off  values  of  .3.4  and  3. .5  are  3 
and  4,  respectively. 
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The  original  weighted  majority  prediction  algorithm 
(Learn-Relation)  makefs  its  prediction  Mij  by  weighted 
majority  voting  by  all  rows  i'  such  that  the  entry  Mi'j 
in  the  same  colnmn  has  been  observed,  each  weighted 
by  the  weight  wn'  representing  the  believed  degree  of 
similarity  between  the  rows  i  and  The  weights  are 
updated  by  multiplying  those  contributing  to  the  cor¬ 
rect  value  by  (2-7)  and  those  contributing  to  a  wrong 
value  by  7,  for  some  7  <  1.  Note  that  this  update  is 
eciuivalent  to  defining  the  weight  11)^1  at  each  trial  as 
(2  —  ,  where  C,;,;/  is  the  number  of  times  i' 

has  voted  for  a  correct  value  in  row  i  and  Wn'  the 
number  of  times  it  voted  for  a  wrong  value. 

G-Learn-Relation  generalizes  Learn-Relation  for 
multi-valued  functions  by  letting  each  row  i'  vote  for 
all  values  (in  V{a))  within  a  permitted  tolerance  from 
its  predicted  value  a  =  Mi'j.  R.s  weights  are  updated 
in  the  same  manner  as  in  Learn-Relation. 

G-Learn-Relation(0  <  7  <  1) 

With  each  row  pair  (?j  i')  is  associated  a  weight,  tvu'. 
We  let  A  denote  the  range  of  entries  of  M,  and  for  any 
a  £  A,  V(a)  denotes  the  set  of  prediction  values  that 
are  considered  correct  when  the  true  value  is  a. 
Initialization:  ra,;,;'  ;=  1 
Prediction: 

{argmax  ^  te,-,:'  if  {/'  :  Ojij  ^  ^ 

i';0, (a) 

Co  (a  constant)  otherwise 

Weight  update:  For  al  i)  such  that  O/ij  51^  * 

._  /  (2  -  7)‘wu'  if  Oi'j  £  V{Mij) 

-  \  7wu>  ilOi.jgViMij) 

Note  in  the  above  (and  else-where)  that  Cq  is  the 
default  value  which  is  used  to  predict  when  no  rele¬ 
vant  observations  have  been  made.  In  all  the  filter¬ 
ing  methods  we  describe  here  and  in  all  of  our  experi¬ 
ments,  we  set  Co  =  3.  Also  in  our  experiments  we  set 
l/(a)  =  {a:  £  A  :  a  —  1  <  a:  <  a  -f  1}. 

We  also  consider  an  extension  of  G-Learn-Relation, 
which  we  call  Cross-G-Learn-Relation,  which  makes 
use  of  observed  values  in  the  same  row  and  differ¬ 
ent  columns,  in  addition  to  those  in  the  same  column 
and  different  rows.  (This  is  a  generalization  of  the 
two-dimensional  weighted  majority  algorithm  called 
WMP2  proposed  in  [ALN95].) 

Cross-G-Learn-Relation(0  <  7  <  1) 

With  each  row  pair  is  associated  a  weight,  -wui, 

and  with  each  column  pair  {j,j')  is  associated  a  weight, 

Initialization:  ■ww  —  uf;,  :=  1 


Prediction: 

argmax(  ^  ww  + 

i':0,,,eV(a) 

if  {/'  :  0,;/j  +}  u  {j'  :  0,;j-  0 

Co  (a  constant)  otherwise 

Weight  update:  Update  -wni  as  in  G-Learn-Relation 
and  additionally  for  all  j'(:f  j)  such  that  0,:j/  yf  *, 

•'  _/  (2-T)«’j-;'  if  Oij/  £  U(A/,y) 

'"fp  -  \  ifO,y-^l/(M,y) 

Next  the  version  of  G-Learn-Relation  in  which  we 
incorporate  indirect  evidence,  referred  to  as  Learn- 
Relation-IE,  enhances  the  weights  used  in  G-Learn- 
Relation  by  taking  into  account  indirect  evidence.  If 
we  let  diii  =  Ciii  -  W,;,;/  with  C,;,;/  and  as  defined 
above,  then  roughly  speaking  dw  >  0  is  evidence  for 
row  i  being  similar  to  row  i' ,  and  da/  <  0  for  the  con¬ 
verse.  If,  for  some  third  row  ■/",  we  have  both  dan  >  0 
and  dinn  >  0,  then  this  can  be  used  as  indirect  evi¬ 
dence  for  i  and  F  being  similar.  Conversely,  if  we  have 
-diH"  <  0,  then  this  is  indirect  evidence  for  i  and  i' 
being  dissimilar.  Thus,  we  redefine  the  weights  of  G- 
Learn-Relation  by  adding  (5-min{|c/,:,;"  |,  \dini'  |}  to  C,;,/  if 
dii"  >  0  and  d, :/,://  >  0,  and  adding  i5’min{|d,:,;»|, 
to  W,;,;/  if  dun  ■  di'i"  <  0,  where  b  is  a  small  constant 
controlling  the  degree  of  contribution  of  indirect  evi¬ 
dence.  The  rest  of  the  algorithm  (prediction  and  direct 
weight  update)  is  the  same  as  G-Learn-Relation. 

Learn-Relatioii-IE(0  <  7  <  1, 0  <  (5) 

With  each  row  pair  {i,  i')  is  associated  counters 

CiU ,  Wj,;/ . 

Initialization:  Cu'  =  Wu'  :=  0 

Prediction:  Predict  as  in  G-Learn-Relation,  except  the 


weights 

■Wji'  are 

calculated  a.s 

;  follows. 

^ii' 

=  Cw 

-  Ww 

^iV 

—  Cw 

+  b 

E 

min{|c/,:i//|,  \diu" 

dj  j// 

>0,dj/ 

iii>0 

fw 

=  lUi,: 

■  +  b 

E 

diii 

/  -d^/ jit 

<0 

row 

=  (2- 

-jY"'! 

Update: 

For  all 

i'{Y  0. 

Cw  Cu'  -f  1  if  Ouj  £ 

Wu-  :=  Ww  -b  1  if  Ouj  ^  V{Mij) 


We  compare  the  performance  of  these  generalized 
weighted  majority  algorithms  against  standard  meth¬ 
ods  based  on  correlation  coefficients.  Here  we  infor¬ 
mally  describe  these  methods  and  refer  the  interested 
reader  to  [SM95]  for  detailed  definitions.  Like  G- 
Learn-Relation,  the  correlation-based  methods  make 
predictions  by  weighted  voting  by  different  rows  for 
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wliirli  the  entries  in  tin'  same  rolnmn  have  been  ol)- 
served,  exop])t  l  lie  \v('ij!,lits  In'twaa'ii  the  rows  are  com¬ 
puted  using  ■(■oriadal  ion  (•o('friri('nfs.'  'Idu'  cona'Iaf  ion 
coefficient  het  wa'en  any  pair  f)(' rows  is  calcnlated  using 
th('  oliserved  valni's  in  t  hos('  rows  in  common  colniiins. 
Following  [SlVlOr)],  W('  also  consider  thiia-  variants  of 
the  basic  correlat.ion-bas('d  method  (also  known  as 
Pcnirsoii  r  method):  d'he  t  hi'c'shofifi'd  iiK'thod  (P<^ai‘- 
soii  r  L  =  0)  which  lets  only  rows  with  a  corre¬ 
lation  co(dfici('ni  higher  than  thia'shold  0  vote:  the 
constrained  iiK'thod  (c.oiistraiiKnl  Pc^arson  r)  which 
fixes  the  average  in  caicniating  tin-  correlation  coi'fli- 
cients  at.  a  const  ant  rat  hei-  t  han  calculat  ing  it  from  the 
data  (in  oiir  ex|)('rinients  it  was  fixc'd  at  d),  and  the 
coinliination  of  the  t  wo;  tin'  thresholded  constrained 
method  (c.oiistraiiKwl  Pciarsoii  r  L  —  0). 

4  Experiments 

4.1  The;  data 

In  onr  experiments,  we  madi'  nse  of  artificially  con¬ 
structed  (sinndat.('d)  data,  as  well  as  real  data  ohtaiTicd 
throngh  actual  cxp('rim('nls  on  collaborative  filtering 
in  a  patent  clipping  service  [AI97].  The  simulation 
data  we  used  were  for  a  t  arget  mat  rix  of  size  100  by  100 
with  ij  row  types  and  h  column  types,  with  noise  added. 
We  first  generated  a  O  by  matrix  Afi,  by  randomly 
assinging  one  of  four  groups  { 1, 2},  {2.  3),  {d,d}.  {d,-')} 
to  each  of  its  entries,  'riien,  based  on  A/;,.  w('  gener¬ 
ate  a  100  by  100  matrix  A/  by  randomly  assigning  oiu' 
of  the  five  rows  of  A/;,  to  each  row  of  A/,  and  one  of 
the  five  columns  of  A//,  to  ('ach  column  of  A/.  Finally, 
we  introduce  noise  by  |irobabifisl  ically  a.s.signing  one 
of  five  scores  ( 1  t.hrongh  fi)  t  o  each  grou|)  according  to 
tiie  following  jirobability  labh'.  For  I'xample,  if  row  /' 
of  Mh  is  assigiK'd  to  row  /  of  A/,  column  j'  of  Mi,  is  as¬ 
signed  to  column  j  of  A/,  and  t  he  giou])  assigiu'd  to  tlu' 
j'-entry  of  A//,  is  {2,3},  then,  tin'  scores  l,2,3.d  utkI 

5  are  assigned  to  the  /,  j-e]iti'y  of  A/  with  |U’obal)ility 

I.  ^  ^  respectively. 


ent.ry  viiluo 

of  Mtj 

probability  of  assigning  each  score 

1  2  3  d  b 

{1-2} 

{2,3} 

{3.4} 

{^1.3} 

11  2  J  .t 

1  1  Cl  ('1  0  0  <1  n 

2  2  2  2  2  2  2  2 

a  i.  IL  J.  -1  iL  ill. 

2  2  2  2  2  2  2  2 

o_2.  H  J.  X  il 

2  2  2  2  2  2  2  2 
a“  ci'^  ri  Cl*  1  o  1 

V  ‘>  ')  ')  ')  ’>  9 

In  our  ex])ei'iments,  w('  s('t  n  =  0.1,  which  translates 
to  noise  I'atc'  of  0.07.'). 

The  real  data  we  used  in  o\ir  experiments  are  scores 
given  by  various  ])eo])l('  on  pat('nts  according  to  their 
interests.  Scores  were'  given  by  77  |)eople  on  2o')8 
patents,  with  about,  bA  |)ei-  C('nt  of  the  entries  filled. 
Since  on  those  ent.i'ies  for  which  few  relat('d  ent  rii's  are 


known,  none  of  tin'  filtering  methods  consich'red  here' 
would  do  well,  w('  extractf'd  a  portion  of  this  data  by 
r('stricting  tin'  peoph'  to  thos('  whc)  scori'd  at  h'ast  •'iO 
patents,  and  the  p;itents  to  thos('  that  were'  rat,('d  by 
at  least  10  peo])l('.  'I  bis  resnlti'd  in  |■^'dncing  tin'  ma¬ 
trix  to  38  peoph'  by  277  patents,  and  2!)  per  (’('lit  of 
t  his  smaller  mat  rix  was  filh'd.  'I'll  ('  scori's,  which  ai'e 
integi'rs  bet  wc'i'ii  1  and  5.  ari'  dist  ributed  as  follows. 


1  2  3  d  T) 

Total 

i.'idi  im  r,()o  (too  42r> 

38 1  .'■) 

Note  that  fill'  highei’  the  score  of  a  pati'iit  is  tin'  more 
interesting  it  is  perceived  b\-  the  user  who  scored  it  . 

In  our  ex])eriment,.s  involving  simulated  data,  tin'  per¬ 
formance  of  each  method  was  averaged  over  4  random¬ 
ized  runs  on  4  randomly  generati'd  target  matrices,  Iti 
runs  in  total.  For  the  experiments  with  real  data,  w(' 
took  average  over  4  randomized  runs  for  each  method. 

4.2  Comparison  with  Corrolat.loii-hasocl 
mothofls 

First,  we  comi)ared  the  jiredictive  j-ierformance  of  Ci- 
Learn-Relation  against  those  of  the  four  variants  of 
correlat  ion-based  methods  described  in  .Sect  ion  3.  For 
the  threshold  value  in  a  t  hresholded  mi'thod,  and  for 
the  valueof  7  in  C-Learn-Helationf^),  we  tried  all  mul¬ 
tiples  of  0.1  between  0  and  1.  'I'he  I'csnlts  are  shown 
in  F'iguri'  1.  Among  tin'  correlation-based  methods, 
it  is  vi'rified  that  the  t  hn'sholded  constrained  Pi'ar- 
son  r  method  did  tin'  best,  as  ri'ported  in  [SMOT)],  It,  is 
ch'ar  that  CMn'arn-Relat  ion  ont-perfoimn'd  all  of  these 
methods  on  ri'al  data,  and  it  was  essi'ntially  tied  with 
the  Ix'st  of  all  the  corri'lat ion-based  methods  on  sim- 
ulati'd  data.  On  ri'al  data,  this  t.i'iidency  is  mon'  ev- 
idi'iit  for  colnmn(patent)-based  mi'thods,  but  as  tin' 
column-based  methods  iierform  better  than  the  row- 
based  methods.  C-bearn-Helation  is  clearly  the  best 
pi'rforming  method  overall. 

\V('  also  evaluated  tlu'se  methods  using  measures  that, 
are  more  often  used  in  iiractical  a])]ificat  ions,  preri- 
sioi)  and  rrcall.  Figure  2  romi)ares  jirecision  and  ri'- 
call  (in  the  last  200  trials  at  each  trial)  for  tin'  two 
Ihreshokh'd  correlation-based  metlnids  and  (J-bearn- 
Relation.  Figure  3  ])lots  a  combination  of  these  two 
measusres  called  ‘F-measure’  (more  precisi'ly  /'b=|  in 
[LewisOd]),  namely  wln'ie  stands  foi'  precision 

and  n  for  recall. 

I'hese  graphs  were  obt, aim'd  using  real  dat  a  using  simi¬ 
larity  between  ])atents.  We  considered  tin'  entrii's  that 
were  given  the  score  of  ■')  as  (hsirahlt  and  |)r('dict,('d 
valued  of  at  least  3..')  to  in'  scl(cl((l.^  We  can  sei'  that 
fi-Li'arn-Relation  arhii'\('s  the  higlu'st  ri'call  rate  and 

"'Idle  precision  is  calculaled  as  wliere  ,'V,  is  the  iiiiiii- 
ber  of  selected  eiitiiis  and  A',  is  the  nnmber  of  rorni'lli/ 
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Figure  1:  Correlation-based  methods  vs.  G-Learn- Relation:  Left  graph:  cumulative  number  of  mistakes;  Right 
graph:  error  rate  in  the  last  200  trials.  Top:  simulated  data;  Middle:  Real  data  (similarity  between  people); 


Bottom:  Real  data,  (similarity  between  patents). 

the  precision  is  comparable  to  others.  Note  that  the 
thresholded  constrained  Pearson  method  which  enjoys 

selected  entries  of  those.  The  recall  is  where  Nj  is  the 
number  of  desirable  entries. 


the  highest  precision  suffers  from  having  a  very  low 
recall  rate. 

Another  desirable  aspect  of  G-Learn-Relation  is  its  rel¬ 
ative  insensitivity  towards  the  exact  choice  of  its  pa- 
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(similarity  between  pal('nls).  Left:  (l-ly('arn-F('l;ition:  Right:  Tlii'esliolded  const  I'tiined  correbition  method. 


rameter  (7  .)  Figure  4  shows  how  t  he  predict  i\  ('  perfor¬ 
mance  of  G-Learn-Relation  and  the  thresholded  con¬ 
strained  correlat.ion  method  vary,  as  wi'  change  their 
parameters.  From  the  data  we  can  see  that  the  perfor¬ 


in, ance  of  the  thresholded  correlat ion-hasi'd  method  is 
e.xtreiiK'ly  sensitive  to  sm;dl  changes  in  the  threshold 
value  in  a  cert  ain  rangi'.  In  coni  rast ,  sm.all  changes  in  7 
of  (l-Learn-Ri'lat  ion  do  not  significant  ly  alfecl  it.s  |)re- 
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Figure  3:  F-measure  {■^^)  in  the  last  200  trials. 


dictive  performance.  This  phenomenon  is  especially 
noticeable  on  the  simulated  data,  although  the  same 
tendency  is  observed  in  the  real  data.  In  a  practi¬ 
cal  application,  a  wrong  choice  of  threshold  could  be 
costly  for  correlation-based  methods. 

4.3  Using  both  rows  and  columns 

We  evaluated  the  performance  improvement  brought 
about  by  the  ‘cross-methods,’  namely  methods  that 
make  use  of  both  row  similarity  and  column  sim¬ 
ilarity,  on  G-Learn-Relation  and  the  best  perform¬ 
ing  correlation-based  method  -  the  thresholded  con¬ 
strained  Pearson  r  method. 

The  results  of  this  experimentation  are  shown  in  Fig¬ 
ure  5.  On  the  simulated  data,  it  is  observed  that  G- 
Learn-Relation  is  more  radically  by  the  cross  method, 
and  as  a  result  its  cross  version  clearly  out-performs 
that  of  the  Thresholded  Constrained  Pearson  r.  On 
the  real  data,  however,  the  performance  of  the  cross 
method  (for  both  G-Lea.rn-Relation  and  Constrained 
Pearson  r)  was  comparable  to  that  of  the  column-based 
method,  although  it  was  significantly  better  than  the 
row-based  method.  This  may  be  partly  attributable 
to  the  asymmetry  of  the  real  data  we  used:  there  were 
only  48  rows  whereas  there  were  277  columns.  In  prac¬ 
tical  applications  with  more  even  numbers  of  rows  and 
columns,  the  effect  may  be  more  visible. 

4.4  Using  indirect  evidence 

We  compared  the  performance  of  Learn-Relation-IE 
and  that  of  G-Learn-Relation,  as  well  as  their  respec¬ 
tive  ‘cross’  versions.  Of  the  two  parameters  ■y,S  in 
Learn-Relation-IE(7,  ^),  the  same  choice  of  7  was  used 
as  G-Learn-Relation,  and  the  best  choice  (out  of  a 
few)  was  used  for  6.  The  results  are  shown  in  Fig¬ 
ure  6.  On  the  simulated  data,  it  is  observed  that 
the  performa.nce  is  improved  significantly  for  both  G- 
Learn-Relation  and  Cross-G-Learn-Relation,  for  a  cer¬ 


tain  range  of  trial  numbers;  trials  around  1,000th  to 
3,000th  out  of  10,000.  On  real  data,  unfortunately,  no 
significant  improvement  was  observed,  except  a  little 
for  the  row(people)-based  method.  It  may  be  that, 
with  this  particular  data  set, The  range  of  trial  num¬ 
bers  on  which  significant  improvement  is  achieved  is 
yet  to  come. 

5  Theoretical  analysis 

In  this  section,  we  theoretically  analyze  the  perfor¬ 
mance  of  G-Learn-Relation  and  that  of  correlation- 
based  methods.  In  particular,  we  prove  an  upper- 
bound  on  the  worst  case  number  of  i-nistakes  made 
by  G-Learn-Relation.  We  also  show  a  lower  bound 
on  the  worst  case  nui-nber  of  mistakes  made  by  the 
correlation-based  method,  which  shows  that,  as  a 
learning  method,  the  correlation-based  method  does 
not  necessarily  converge,  and  can  make  a  huge  num¬ 
ber  of  mistakes  in  the  worst  case. 

5.1  Mistake  bound  for  G-Learn-Relation 

It  can  be  shown  that  the  upper  bound  obtained  by 
Goldman  and  Warmuth  for  Learn- Relation  can  be  gen¬ 
eralized  for  G-Learn-Relation,  when  the  target  func¬ 
tion  is  real- valued  and  V{a)  is  defined  as  V'(a)  = 

Vd{a)  =  {x  ■.  a  -  d  <x  <  a  +  d]. 

We  need  a  few  definitions  to  state  our  result.  Let  p  be 
a  partition  over  the  set  of  rows  R,  kj,.  the  size  of  the 
partition,  and  p  =  {5^, ...,  5*'*’}..  Tlheni  llet  nn  =  |S*| 
and  5]  =  :  r  6  5*}.  Let  be-  the  number 

of  r  €  5*  such  that  E  and  define 

Sij  =  Ui  —  masxMaiSj). 

Let  the  set  of  columns  be  {1, ...,  m}.  Let  6i  —  YAj’=i 

and  define  the  noise  ap  of  partition  p  as  ap  =  A- 

Then,  the  following  theorem  holds. 


Theorem  1  For  all  7  £  [0, 1),  Algorithm  G-Learn- 
RelationA)  makes  at  most 


min  <  kpm  +  min 

p 


^ioge  +  apin--^)\ogj 


3mn2  log  kp  +  2ap{mn  —  Up)  log  ^ 


log 


i+P 


1 


mistakes.  Here,  /?  =  ,  and  the  first  minimization 

is  with  respect  to  all  possible  partitions  p  satisfying  the 
following  condition: 

6i  <  iijm/'I  for  all  i  =  1, ...,  kp.  (1) 


The  proof  is  similar  to  the  proof  of  the  analogous  the¬ 
orem  for  Learn-Relation  in  [GW95],  and  omitted  due 
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to  lack  of  siiacp,  Tlio  condit  ion  (1)  states  tliat  more 
than  half  tin'  olomonts  in  each  ptirtition  assumes  a  rep¬ 
resentative  vahie.  This  is  reasonalih',  and  note  in  jiar- 
ticular,  t.liat  it  alwtiys  liolds  wlnm  the  targi't  function 
is  liina.ry.  Now,  liy  iilngging  in  ")  =  0,o,,  =  0  in  the 
above  theoroin,  w('  obt  ain  tin'  following  ;is  corolhtry. 


Corollary  1  For  (iiiij  .s.s  parhUoi)  p  (wiili 

cvj,  =  0/.  Algoriilnii  (i-L((ni>-H(  l(ilioii(i))  makes  at 
most 


kpiv  +  ruin 


2< 


log  r .  -v log  kj, 


mistakes. 


5.2  A  iiiistako  lowor  hound  for  tli(' 
corrolatioii-basod  iiiotliods 

We  show  that,  in  the  absr'tice  of  noise,  the  correhit  ion- 
based  method  can  make  a  lot  more  mistakes  in  the 
worst  case  than  G-Learn-Relat  ion(O)  with  V  (a)  =  {n}. 


For  this  analysis,  w('  assume  that  a  ]n'('dirtion  is  corri'ct. 
only  when  the  ronnd-off  value  ecpials  tin'  correct  valiu', 

Thooroni  2  In  the  worst  rase,  anij  of  the  four 
(oridatiou-hased  mdhods  can  make  as  manii  as  un>/(' 
mistakes,  where  v  is  the  niimher  of  rows  and  in  is  the 
number  of  columns,  and  ('  is  a  positive  constant. 

Proof 

W('  first  |)rove  tin'  statr'inent  for  t  in'  httsic  corri'lat  ion- 
b;is('d  iiK'thod  with  no  threshold  and  no  const ftiint,, 
Snp|X)se  that  thi'  ttirget  matrix  consists  of  mtiny  rr'jn'- 
titions  (in  both  row  and  colnmn  directions)  of  tin'  fol¬ 
lowing  block  consisting  of  two  t,y]ies  of  rows  and  four 
types  of  columns,  where  i  ranges  over  0  to  ;n/(S  —  1, 
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Thus  if  the  target  matrix  were  ».  by  m,  it  would  consist 
of  n/2  rows  of  type  1  aiicl  n/2  rows  of  type  2,  and 
each  type  of  row  would  consist  of  m/S  repetitions  of  8 
columns.  Suppose  that  the  trials  proceed  from  left  to 
right  and  top  down  by  blocks.  Within  each  block,  the 
trials  proceed  by  column,  except  the  first  two  columns 
(8/  +  1-st  and  Si  +  2-nd  columns)  are  predicted  in  the 
row-first  ordering.  That  is,  after  predicting  the  8/!-l- 1- 
st  column  in  the  type  1  row,  the  Si  +  2-nd  column  in 
the  same  row  is  predicted  before  the  two  columns  in 
the  other  (type  2)  row  are  predicted.  Note,  with  this 
ordering,  that  when  predicting  an  Si  +  1-st  column  of 
any  block,  the  a.verage  of  the  past  values  for  any  rows 
is  3.  It  can  be  shown  that,  when  the  correlation-based 
method  is  predicting  an  entry  in  the  Si  +  1-st  column, 
the  predicted  value  will  not  exceed  3-P2/3  for  a  row  of 
type  1,  and  it  will  be  at  least  3  -  2/3  for  a  type  2  row. 
This  is  because  the  number  of  known  entries  in  a  row 
of  the  same  type  does  not  exceed  the  number  of  known 
entries  in  a  different  type  of  row.  Thus,  the  correlation- 
based  method  makes  a  mistake  on  every  entry  in  the 
Si  +  1-th  column.  Hence,  if  n  is  the  number  of  rows 
and  7n  is  the  number  of  columns,  it  will  make  at  least 
nm/S  mistakes. 

Note  that  the  above  argument  applies  on  the  con¬ 
strained  correlation-based  method,  since  the  average 
is  fixed  at  3.  A  thresholded  correlation-based  method 
can  beat  the  above  example  by  setting  the  threshold  to 
be  higher  than  1/2,  but  for  any  fixed  value  of  thresh¬ 
old,  an  analogous  example  can  be  constructed  by  mak¬ 
ing  the  block  longer,  repeating  the  last  two  columns  an 
appropriate  number  of  times.  This  would  yield  a  sim¬ 
ilar  bound,  except  a  different  constant  replaces  8.  □ 

In  contrast,  we  know  from  Corollary  1  that  G-Learn- 
Relation(O)  makes  at  most  2m-Fn\/3m  mistakes.  Note 
that  as  m  and  n  become  large,  the  final  error  rate  of 
G-Learn-Relation(O)  will  approach  zero  (the  learning 
converges) ,  but  the  error  rate  of  the  correlation-based 
method  (in  this  worst  case)  will  not  be  lower  than  l/C. 

6  Concluding  remarks 

We  have  applied  weighted  ma.jority  type  prediction 
algorithms  on  the  problem  of  collaborative  filtering, 
and  empirically  demonstrated  that  they  perform  bet¬ 
ter  than  the  correlation-based  filtering  methods.  In  so 
doing,  we  proposed  a  genercilization  G-Learn-Relation 
of  the  weighted  majority  prediction  algorithm  of  Gold¬ 
man  and  Warmuth  [GW9.5]  to  the  multi-valued  set¬ 
ting,  and  gave  a  theoretical  performance  guarantee  on 
the  performance  of  this  algorithm.  Additionally,  we 
proposed  a  version  of  G-Learn-Relation  that  makes 
use  of  indirect  evidence,  as  well  as  a  version  in  which 
both  row  similarity  and  column  similarity  are  used  for 
prediction.  In  both  cases,  significa,nt  performance  im¬ 
provement  was  observed  in  experiments  involving  sim¬ 


ulated  data.  It  is  left  as  future  research  to  verify  the 
same  on  real  data,  which  we  believe  will  recpiire  larger- 
scale  experiments. 
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Abstract  1  Introduction 


We  consider  feature  selection  in  the  “wrap¬ 
per”  model  of  feature  selection.  This  typi¬ 
cally  involves  an  NP-hard  optimization  prob¬ 
lem  that  is  approximated  by  heuristic  search 
for  a  “good”  feature  subset.  First  consider¬ 
ing  the  idealization  where  this  optimization  is 
performed  exactly,  we  give  a  rigorous  bound 
for  generalization  error  under  feature  selec¬ 
tion.  The  search  heuristics  typically  used  are 
then  immediately  seen  as  trying  to  achieve 
the  error  given  in  our  bounds,  and  succeed¬ 
ing  to  the  extent  that  they  succeed  in  solv¬ 
ing  the  optimization.  The  bound  suggests 
that,  in  the  presence  of  many  “irrelevant” 
features,  the  main  source  of  error  in  wrap¬ 
per  model  feature  selection  is  from  “overfit¬ 
ting”  hold-out  or  cross-validation  data.  This 
motivates  a  new  algorithm  that,  again  under 
the  idealization  of  performing  search  exactly, 
has  sample  complexity  (and  error)  that  grows 
logarithmically  in  the  number  of  “irrelevant” 
features  -  which  means  it  can  tolerate  hav¬ 
ing  a  number  of  “irrelevant”  features  expo¬ 
nential  in  the  number  of  training  examples 
-  and  search  heuristics  are  again  seen  to  be 
directly  trying  to  reach  this  bound.  Experi¬ 
mental  results  on  a  problem  using  simulated 
data  show  the  new  algorithm  having  much 
higher  tolerance  to  irrelevant  features  than 
the  standard  wrapper  model.  Lastly,  we  also 
discuss  ramifications  that  sample  complexity 
logarithmic  in  the  number  of  irrelevant  fea¬ 
tures  might  have  for  feature  design  in  actual 
applications  of  learning. 


In  recent  years.  Feature  Selection  for  classification 
and  regression  has  been  enjoying  increasing  interest 
in  the  Machine  Learning  community.  Impressive  per¬ 
formance  gains  have  been  reported  by  numerous  au¬ 
thors,  and  numerous  feature  subset  search  heuristics 
have  been  proposed.  (The  literature  is  too  wide  to  sur¬ 
vey'  here,  but  see  [Langley,  1994]  and  [Miller,  1990]  for 
overviews.)  In  view  of  these  significant  empirical  suc¬ 
cesses,  one  central  question  is:  What  theoretical  jus¬ 
tification  is  there  for  feature  selection?  For  example, 
in  parametric  function  approximation  schemes  such  as 
linear  regression,  it  is  often  the  case  that  excluding  a 
feature  is  mathematically  identical  to  setting  the  co- 
efficient(s)  associated  with  that  feature  to  0.  As  fea¬ 
ture  selection  typically  runs  a  risk  of  misidentifying  the 
“irrelevant"  features,  why  then  is  it  apparently  often 
superior  to  try  to  estimate  which  features  are  “irrele¬ 
vant”  and  set  their  coefficients  to  0,  rather  than  leave 
them  and  use  the  estimated  coefficients  for  these  fea¬ 
tures  (which  will  typically  be  near  0  anyway)?  The 
theoretical  results  in  this  paper  will  address  this  ques¬ 
tion. 

Since  feature  selection  attempts  to  eliminate  “irrele¬ 
vant”  features,  another  central  question  is:  How  does 
the  performance  of  feature  selection  scale  with  the 
number  of  irrelevant  features?  The  Winnow  algorithm 
of  Littlestone  for  learning  Boolean  monomials,  or  more 
generally  also  k-DNF  formulae  and  r-of-k  threshold 
functions  (over  boolean  inputs),  from  noiseless  data 
enjoys  worst-case  loss  logarithmic  in  the  number  of 
irrelevant  features  [Littlestone,  1988].  Likewise,  the 
EG  algorithm  for  linear  regression  with  quadratic  error 
also  has  such  loss  (and  indeed  sample  complexity)  that 
grows  logarithmically  in  the  number  of  irrelevant  fea¬ 
tures  [Kivinen  and  Warmuth,  1994].  For  learning  from 
noiseless  data,  of  a  representation  of  a  boolean  concept 
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(over  boolean  inputs),  Almuallim  and  Dietterich  have 
also  shown  that  an  algorithm  that  finds  the  smallest 
set  of  features  consistent  with  the  training  data  (such 
as  by  exhaustive  enumeration)  also  enjoys  loss  loga¬ 
rithmic  in  the  number  of  irrelevant  features  [Almual¬ 
lim  and  Dietterich,  1994].  If  it  were  true  in  general 
that  feature  selection  makes  sample  complexity  loga¬ 
rithmic  in  the  number  of  irrelevant  features  (though 
possibly  depending  more  heavily  on  the  number  of  rel¬ 
evant  features),  then  this  would  imply,  for  example, 
that  squaring  the  number  of  features  we  have  means 
needing  only  twice  as  much  training  data.  This  could 
have  huge  ramifications  on  the  way  features  are  de¬ 
signed  for  real-world  applications.  In  this  paper,  we 
will  show  that,  modulo  computational  and  approxi¬ 
mation  issues,  this  ideal  of  logarithmic  sample  com¬ 
plexity  in  the  number  of  irrelevant  features  -  which  of 
course  means  being  able  to  handle  exponentially  many 
irrelevant  features  as  training  examples  -  can  indeed 
be  achieved  with  a  new  feature  selection  algorithm  we 
propose. 

Next,  the  notion  of  “relevance”  is  closely  related  to  fea¬ 
ture  selection.  Intuitively,  one  goal  of  feature  selection 
is  to  eliminate  all  but  a  small  set  of  “relevant”  fea¬ 
tures,  which  are  then  given  to  an  induction  algorithm. 
However,  there  have  been  difficulties  with  a  number 
of  definitions  of  “relevance”  [Kohavi  and  John,  1997], 
and  we  take  the  alternative  view,  which  is  quite  simi¬ 
lar  in  flavor  to  those  in  [Littlestone,  1988]  and  [Kivinen 
and  Warmuth,  1994],  of  the  goal  of  feature  selection 
as  this:  If  there  exists  a  hypothesis  that,  using  only  a 
“small”  number  of  features,  gives  good  generalization 
error,  then  we  want  our  classifier  to  achieve  close  to 
this  level  of  performance  with  high  probability.  This 
will  be  made  rigorous  in  subsequent  sections,  but  note 
in  particular  that  we  make  no  claims  towards  exclud¬ 
ing  “irrelevant”  features  or  including  all  the  “relevant” 
features,  so  long  as  the  particular  set  of  selected  fea¬ 
tures  allows  us  to  have  performance  close  to  that  of 
using  the  “optimal”  set  of  features.  ^  In  the  remain¬ 
der  of  this  paper,  we  will  use  the  terms  “relevant”  and 
“irrelevant”  only  when  we  expect  them  to  be  consis¬ 
tent  with  any  reasonable  definition  of  relevance. 

Using  the  terminology  introduced  by  [John  et  al., 
1994],  feature  selection  algorithms  broadly  fall  into 
the  “filter”  and  the  “wrapper”  models.  The  filter 
model  relies  on  general  characteristics  of  the  training 


^Aside  from  good  generalization  error,  other  goals  of 
feature  selection  might  be  user-interpretability  and  parsi¬ 
mony  of  hypotheses  for  fast  prediction.  We  will  not  address 
these  goals  in  this  paper. 


data  to  select  some  feature  subset,  doing  so  without 
reference  to  the  learning  algorithm.  In  the  wrapper 
model,  one  generates  sets  of  candidate  features,  runs 
them  through  the  learning  algorithm,  and  uses  the  per¬ 
formance  of  the  resulting  hypothesis  to  evaluate  the 
feature  set.  While  the  wrapper  model  tends  to  be 
more  computationally  expensive,  it  also  unsurprisingly 
tends  to  find  feature  sets  better  suited  to  the  inductive 
biases  of  our  learning  algorithm,  and  tends  to  give  su¬ 
perior  performance  [Langley,  1994].  In  this  paper,  we 
study  only  the  wrapper  model  of  feature  selection,  and 
largely  in  the  context  of  classification. 

Our  analysis  is  largely  inspired  by  [Kearns,  1996],  with 
our  theoretical  results  heavily  based  on  the  techniques 
given  there  and  those  outlined  in  [Kearns  et  al.,  1997]. 
We  also  rely  heavily  on  tools  from  [Vapnik,  1982],  that 
give  a  very  general  framework  for  bounding  the  devi¬ 
ation  of  training  error  from  generalization  error. 

2  Preliminaries. 

2.1  Feature  Selection 

Let  X  be  the  fixed  /-dimensional  input  space,  where  / 
is  the  number  of  features  in  the  inputs  we  are  provided. 
For  simplicity,  we  also  assume  a  fixed  binary  concept 
c  :  X  1 — >  We  are  provided  m  training  exam¬ 
ples  S  =  with  each  of  the  /-dimensional 

input  vectors  x*  =  \x\  x\,  ...  x}]^  drawn  i.i.d.  from 
some  fixed  distribution  Dx  over  A,  and  correspond¬ 
ing  labels  j/*  =  c(x*)  e  {0,1}.  In  this  development, 
we  will  also  briefiy  consider  the  case  where  the  labels 
are  independently  corrupted  by  noise  with  a  noise  rate 
T]  e  [0,0.5),  so  that  y®  =  c(x*)  with  probability  I  -  rj, 
and  y*  =  1  —  c(x®)  with  probability  tj.  Note  that  c 
may  use  all  /  features,  but  we  hope  that  it  can  be  ap¬ 
proximated  well  (in  the  generalization-error  sense,  to 
be  defined  shortly)  by  a  function  that  depends  only  on 
a  small  subset  of  the  /  features. 

We  will  use  uppercase  F  to  denote  sets  of  features, 
and  use  Fi  to  identify  the  i-th  feature.  For  exam¬ 
ple,  the  feature  set  including  the  1st,  4th  and  10th 
features  may  be  written  F  =  {Fi,  F4,  Fio}.  For  any 
input  vector  x,  let  x|i?  be  x  with  all  the  features  not 
in  F  eliminated;  sometimes,  we  will  call  this  “x  re¬ 
stricted  to  F.”  Analogously,  let  X\f  denote  the  in¬ 
put  space  X  with  all  the  dimensions/features  not  in 
F  eliminated,  and  5|f  be  the  data  set  S  with  each 
X*  replaced  by  x^lf.  In  a  slight  abuse  of  notation,  if 
we  have  a  hypothesis  h  :  X\f  ' — >  {0, 1}  defined  only 
the  subspace  of  features  X\f,  we  extend  it  to  X  in 
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the  natural  way  (with  h  ignoring  features  not  in  F). 
Thus,  for  any  hypothesis  h,  we  can  write  the  gener¬ 
alization  error  (with  respect  to  uneorrupted  data)  as 
e(/i)  =  Pr^  ^D^[h{x)  /  c{x)]  (where  the  dependence 
of  £{h)  on  Dx  has  been  suppressed  for  notational 
brevity,)  and  the  empirical  error  on  a  set  of  data  S 
as  is{h)  =  ||r||{(a;,r/)  e  S|/i(a:)  ?/}|. 

2.2  The  wrapper  model 

In  the  wrapper  model  of  feature  selection  suggested 
by  [John  et  al.,  1994],  we  are  given  a  learning  algo¬ 
rithm  L  that,  for  any  set  of  features  F,  takes  a  training 
set  5|f,  and  outputs  a  hypothesis  h  :  X\f  i — >  {0, 1}. 
Given  a  training  set  S,  an  application  of  feature  se¬ 
lection  under  this  model  might  randomly  split  S  into 
a  training  set  S'  of  size  (1  -  ^)m  and  a  hold-out  set 
S"  of  size  7m,  and  perform  a  search  for  a  set  of  fea¬ 
tures  F  so  that  when  the  learning  algorithm  is  ap¬ 
plied  to  S'  restricted  to  F,  the  resulting  hypothesis 
h  =  L(S'|/r)  has  low  empirical  error  on  the 

hold-out  data  S".  Here,  7  €  [0, 1],  the  fraction  of 
5  assigned  to  the  hold-out  set,  is  called  the  hold-out 
fraction.  A  more  sophisticated  application  of  feature 
selection  may  use  n-fold  or  leave-one-out  cross  valida¬ 
tion  rather  than  hold-out.  But  as  they  asymptotically 
yield  at  best  small-constant  improvements  over  using 
hold-out  and  as  leave-one-out  is  at  worst  little  better 
than  training  error  in  estimating  generalization  error, 
while  rendering  the  algorithm’s  performance  much  less 
tractable  to  analysis  [Kearns  and  Ron,  1997],  we  will 
not  explicitly  consider  them  here,  though  we  believe 
our  results  will  be  suggestive  of  the  performance  of 
these  schemes  as  well. 

For  any  given  learning  algorithm  L,  the  optimal  way 
to  perform  feature  selection  is  intimately  related  to 
the  inductive  biases  of  L.  For  example,  if  L  is  “suffi¬ 
ciently  clever”  about  doing  its  own  feature  selection, 
then  one  would  simply  give  it  S  unrestricted  to  any  fea¬ 
ture  subset,  and  allow  it  to  select  its  own  features.  For 
this  analysis,  therefore,  we  make  the  (rather  strong) 
assumption  that  given  a  particular  data  set  L 
chooses  the  hypothesis  h  from  some  class  of  hypotheses 
(shortly  to  be  formalized)  so  as  to  minimize  training 
error.  This  closely  ties  in  with  the  learning  framework 
studied  by  [Vapnik,  1982],  and  is  also  used  in  [Kearns, 
1996]  and  [Kearns  et  al.,  1997]  in  proving  bounds  on 
generalization  error.  We  believe  it  to  be  a  very  natural 
model,  and  that  it  is  a  rich  enough  class  of  learning  al¬ 
gorithms  to  merit  detailed  study.  (But  also  .see  [Kearns 
et  al.,  1997]  for  comments  regarding  relations  to  learn¬ 
ing  algorithms  that  do  not  exactly  do  this;  for  example. 


it  is  not  difficult  to  derive  rigorous  generalizations  of 
all  of  our  results  if  L  manages  to  only  approximately 
minimize  training  error.) 

More  formally,  for  any  feature  set  F,  we  assume  that 
we  have  a  hypothesis  class  Hp,  of  hypotheses  each  with 
domain  A|/r.  But,  with  many  induction  algorithms, 
each  feature  is  treated  in  a  “similar”  manner  -  for  ex¬ 
ample,  when  X  =  7Z^ ,  then  for  two  feature  sets  F  and 
F'  of  the  same  size,  it  makes  intuitive  sense  to  iden¬ 
tify  A  If  and  X\f'  and  therefore  Hf  and  Hf',  as  they 
are  both  sets  of  functions  mapping  from  to  {0, 1}. 
For  simplicity,  let  us  further  make  the  assumption  that 
the  hypothesis  class  Hf  depends  on  F  only  through 
|F|,  and  let  Hy  be  our  set  of  functions  with  domain  X 
restricted  to  any  set  of  r  features.  (This  assumption  is 
not  really  nece.ssary,  but  it  greatly  eases  our  notational 
burden,  and  leaving  out  the  assumption  does  not  gain 
much  in  terms  of  theoretical  results.)  It  will  always  be 
clear  from  context  which  particular  set  F  of  features 
h  €  H\f\  takes  as  input.  Note  also  that  we  have  as¬ 
sumed  that  there  is  some  “uniform”  way  of  handling  all 
features,  whether  they  are  discrete/continuous,  have 
different  ranges,  etc..  For  simplicity,  one  may  wish 
to  think  of  the  particular  case  where  all  features  are 
real  numbers  for  the  remainder  of  this  paper.  In  this 
notation  then,  our  previous  assumption  of  error  min¬ 
imization  is  that  when  L  is  given  S\f,  it  outputs  the 
hypothesis  h  €  Hf  (where  Hf  is  identified  with  H^f\) 
that  minimizes  training  error  on  S|f .  For  the  remain¬ 
der  of  this  paper,  wo  will  implicitly  assume  L  meets 
those  two  assumptions  -  that  it  treats  features  “uni¬ 
formly,”  and  that  it  minimizes  training  error  over  Ff|/r|. 

One  more  definition  we  need  is  to  let  rye  be  the 
Vapnik-Chervonenkis  dimension  [Vapnik  and  Chervo- 
nenkis,  1971,  Vapnik,  1982]  of  the  hypothesis  class  F,.. 
Normally,  we  expect  Oyc  <  fvc  <  2vc  <  ■  ■  -,  though 
this  is  not  an  assumption  we  use.  For  example,  if  Hr 
is  the  class  of  linear  discriminant  functions  over  7^'', 
then  rye  =  r  -I-  1.  We  chose  this  notation  so  that, 
to  specialize  our  ensuing  bounds  on  generalization  er¬ 
ror  to  linear  discriminant  functions,  which  we  later  use 
in  our  experiments,  rye  may  everywhere  be  replaced 
with  r  (or  at  least  when  r  >  0). 

Finally,  to  obtain  the  performance  bounds,  we  wish 
to  make  statements  of  the  form  that  “we  will,  with 
high  probability,  find  a  hypothesis  with  generalization 
error  no  wor.se  than  z  more  than  the  best  hypothesis 
that  uses  r  features.”  To  formalize  this,  define  the  ap¬ 
proximation  rate  function  £g{r)  to  be  the  least  gener¬ 
alization  error  achievable  by  any  hypothesis  h  e  H,- 
using  any  set  of  r  features.  In  general,  we  expect 


On  Feature  Selection  407 


£g(l)  >  ^3(2)  >  •  ■  though  this  is  also  not  an  assump¬ 
tion  we  require  (except  briefly  when  we  summarize  our 
results  in  terms  of  sample  complexity). 

Thus,  in  the  common  instantiation  of  wrapper  model 
feature  selection,  we  search  for  a  feature  set  F  such 
that  when  L  is  applied  to  5'1fi  the  resulting  hypothe¬ 
sis  has  low  empirical  error  on  the  hold-out  set.  (That 
is,  is"  {L{S'\f))  is  minimized.)  Leaving  aside  details 
of  the  actual  search,  we  will  call  this  idealization  the 
STANDARD-WRAP  algorithm.  Note  that  in  performing 
the  search,  enumeration  over  all  the  2^  possible  fea¬ 
ture  sets  is  usually  intractable,  and  there  is  no  known 
algorithm  for  otherwise  performing  this  optimization 
tractably.  Indeed,  the  Feature  Selection  problem  in 
general  is  NP-hard  [Garey  and  Johnson,  1979],  but 
much  work  over  recent  years  has  developed  a  large 
number  of  heuristics  for  performing  this  search  effi¬ 
ciently.  (Again,  the  literature  is  too  wide  to  survey 
here,  but  examples  include  [Moore  and  Lee,  1994, 
Caruana  and  Prietag,  1994,  Yang  and  Hoavar,  1997], 
and  [Langley,  1994,  Miller,  1990j  include  overviews.) 
In  this  development,  we  will,  in  the  style  of  [Kearns, 
1996],  give  bounds  for  generalization  error  when  this 
optimization  is  performed  exactly.  Of  course,  the  ex¬ 
tent  to  which  our  bounds  predict  actual  performance 
will  in  part  depend  on  the  extent  to  which  the  opti¬ 
mization  algorithms  succeed  in  performing  this  search 
on  “real  life”  distributions  of  data.  Alternatively, 
one  can  also  view  these  bounds  as  what  the  heuris¬ 
tic  search/approximation  algorithms  are  (in  a  rigorous 
sense,  to  be  discussed  later)  aspiring  to  do,  with  the 
bounds  giving  insight  into  how  we  might  expect  the 
algorithms  to  perform. 

3  Main  Results 

The  ensuing  bounds  are  all  given  to  hold  “with  high 
probability.”  We  defer  their  more  detailed  versions  to 
the  full  paper,  but  note  that  when  we  say  “with  high 
probability,”  we  mean  that  the  bound  holds  with  at 
least  probability  1  -  J  for  any  J  >  0,  with  constants 
that  depend  on  5  (through  an  omitted  log  j  term)  hid¬ 
den  by  the  O(-)  notation. 

Bound  for  performance  without  feature  selec¬ 
tion 

The  Universal  Estimate  Rate  bound  of  Vapnik  and 
Chervonenkis  [Vapnik  and  Chervonenkis,  1971,  Vap¬ 
nik,  1982]  gives  a  bound  on  generalization  error  when 
learning  using  all  /  features  without  feature  selection. 


Theorem  1  (Vapnik  and  Chervonenkis,  1971) 

With  high  probability,  the  generalization  error  of  the 
hypothesis  h  =  L(S),  given  by  L  applied  to  S  (unre¬ 
stricted  to  any  feature  subset),  is  bounded  by: 


e{h)  <  B,{f)  +  O  (^log^  +  l)  ^  (1) 

Note  this  is  a  bound  for  learning  from  noiseless  data; 
when  the  training  data  labels  have  independently  been 
corrupted  at  some  noise  rate  t],  the  second  term  in  the 

bound  becomes  O  (i-fem  Oog  +  1))  • 

Bound  for  performance  of  wrapper  model 

Applying  the  proof  technique  given  in  [Kearns,  1996] 
(used  to  bound  the  error  of  hold-out)  to  feature  selec¬ 
tion,  we  obtain  the  following  theorem: 

Theorem  2  Given  L,  S,  7,  the  hypothesis  h  output  by 
STANDARD- WRAP,  given  by  h  =  L(5'|^)  where  F  = 
argminp  £5//(L(5'|f)),  will,  with  high  probability,  have 
generalization  error  bounded  by 


e{h)  < 


Proof  (Sketch):  The  first  square-root  term  is  sim¬ 
ply  the  universal  estimation  rate  bound  as  before, 
that  says  that  with  high  probability,  the  hypothe¬ 
sis  obtained  by  applying  L  to  S'\f  for  any  fixed 
F  with  |F|  =  r  will  give  additional  error  no  more 
than  +  !))•  Following  this,  using 

a  holdout-test  set  of  size  'ym  to  test  2^  hypotheses 
will,  by  a  standard  Chernoff-bound  argument,  result 
with  high  probability  in  picking  a  hypothesis  with  gen¬ 
eralization  error  no  more  than  0{y/log{2f) /jm)  = 
0{y/f/'ym)  higher.  □ 

Again,  this  bound  holds  only  when  learning  from 
noiseless  data.  Similar  to  Theorem  1,  a  generalization 
to  learning  from  noisy  data  can  be  obtained  by  replac¬ 
ing  all  occurrences  of  m  in  any  denominator  term  in 
the  bound  by  (1  —  2r/)^m,  where  77  is  the  noise  rate. 
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One  important  remark  here  is  that  the  0{^ f /^m) 
term  is  a  worst-case  bound  for  evaluating  2^  hypothe¬ 
ses  on  the  independent  hold-out  set  S”  of  size  7m. 
Its  increase  with  /  reflects  the  fact  that  we  are  test¬ 
ing  a  set  of  hypotheses  of  size  exponential  in  /,  and 
that  there  is  potential  for  “overfitting”  the  7m  hold¬ 
out  samples.  (In  the  context  of  feature  selection,  the 
issue  of  overfitting  of  hold-out  data  was  also  raised 
by  [Kohavi  and  Sommerfield,  1995];  see  also  [Ng.  1997] 
for  a  detailed  discussion  of  overfitting  of  hold-out  data 
in  hypothesis  selection.)  But  since  this  is  a  worst- 
case  bound,  it  holds  in  particular  for  the  “bad  ca.se” 
where  all  2^  hypotheses  are  “very  different”  from  each 
other.  This  is  unlikely  as  they  were  trained  on  the 
same  dataset  S'  and  using  only  /  distinct  features. 
For  at  least  some  pathological  hypothesis  cla.sses  (that 
may,  for  example,  include  a  set  of  hash-like  basis  func¬ 
tion  so  that  changing  one  feature’s  range  dramatically 
changes  the  output  hypotheses,)  this  is  certainly  po.s- 
sible;  but  for  more  “sensible”  hypothesis  classes,  wc 
might  expect  it  to  be  possible  to  significantly  tighten 
this  bound.  We  have  not  managed  to  formalize  this 
yet,  but  conjecture,  based  on  the  behavior  of  power- 
law  decay  learning  curves,  that  the  a.symptotic  be¬ 
havior  for  “many”  learning  algorithms  will  be  better 
modeled  by  replacing  this  last  term  in  the  bound  by 
for  some  a  €  (0,1].  (A  preliminary  analy¬ 
sis  suggests  that  under  a  (perhaps  surprisingly  large) 
range  of  formal  modeling  assumptions  regarding  how 
much  hypotheses  change  when  F  is  changed,  the  num¬ 
ber  of  “significantly  different”  hypotheses  does  grow 
as  which  would  suggest  q  =  1  behavior.  On  the 

other  hand,  there  are  certainly  also  some  reasonable 
assumptions  that  would  lead  to  q  <  1;  and  we  defer  a 
detailed  discussion  of  this  to  the  full  paper.) 

Bound  for  performance  of  new  algorithm 

For  STANDARD-WRAP,  the  dependence  on  /  of  our 
bound  on  the  error  is  y/Z/ym  (or  possibly  \/f"/'ym), 
and  it  comes  from  testing  2^  hypothesis  on  holdout- 
data.  If  /  ^  rvc  where  r  is  the  number  of  features 
needed  to  approximate  the  target  concept  well,  this 
will  be  the  dominant  term.  Consider  instead 
the  following  algorithm,  which  we  call  ordered-fs: 

1.  For  each  0  <  r  <  /,  find  the  hypothesis  /i,.  that,  of 
all  the  hypotheses  using  exactly  r  features,  mini¬ 
mizes  error  on  the  training  set  S'.  (This  involves 
a  search  over  all  sets  of  r  features.) 

2.  Evaluate  all  f  +  1  hypotheses  hold¬ 

out  set  S",  and  pick  the  one  with  the  smallest 


hold-out  error. 

Note  that  we  are  now  testing  only  0(f)  hypotheses  on 
the  hold-out  data,  so  the  previous  \/Jfjrn  term  now 
becomes  y/ (log  f)/gTn. 

Theorem  3  Given  L.  S,  7,  the.  hypothesis  h  output  by 
ORDERED-FS  will,  with  high  probability,  have  general¬ 
ization  error  bounded  by 


£{h)  < 


min 

0<r</ 


^£g(r)  +  O 


rvc 


(1  -  7)7 


log - -t-  1 

rvc  j 


+0 


(  I  r\ogf  \ 
(1  -7)777  j 


Proof  (Sketch):  The  first  square-root  term  is  simply 
the  uniiH'rsal  estimation  rate  bound  as  before,  used 
to  bouncl  the  additional  error  when  training  on  any 
fixed  feature  set.  For  this  to  hold  with  probability 
1  -  (5,  there  is  also  an  additive  (I/777)  log(l/(5)  within 
the  square-root.  Now,  for  any  fixed  r,  we  want  to 
uniformly  bound  the  deviation  of  training  error  from 
generalization  error  for  all  (1)  hypotheses  that  use 
exactly  r  features.  Taking  a  standard  union  bound 
(see  [Vapnik,  1982]),  we  replace  (l/rr?)  log(l/(5)  with 
{l/m)\og{{l)  /S),  which  (noting  log  (^)  <  rlog/) 
gives  the  second  term.  Lastly,  the  third  term  comes, 
using  a  standard  Chernoff-bound  argument  as  before, 
from  testing  0{f)  hypotheses  on  the  hold-out  set  of 
size  7777.  □ 

Notice  that,  similar  to  STANDARD- WRAP,  we  have 
not  explicitly  addressed  the  NP-hard  search  problem 
for  the  optimal  (here  in  the  minimum  training  error 
sense)  set  of  r  features,  and  actual  implementations 
of  ORDERED-FS  will  generally  have  to  rely  on  heuristic 
search.  But  for  now,  let  us  beg  this  computational 
issue  and  treat  it  similarly  to  how  we  had  treated 
STANDARD-WRAP,  appealing  to  the  same  approxima¬ 
tions/idealizations  as  before,  and  also  mentioning  that, 
in  a  rigorous  sense  to  be  discussed  later,  the  extent  to 
which  an  approximation  algorithm  can  solve  the  opti¬ 
mization  is  exactly  the  extent  to  which  its  error  bound 
will  reach  the  bound  we  give  here,  which  means  that 
our  bound  can  as  before  be  interpreted,  in  a  formal 
sense,  as  being  exactly  what  a  heuristic  search  imple¬ 
mentation  is  trying  to  attain.  (In  considering  heuris¬ 
tic  search  implementations,  it  is  also  worth  mention¬ 
ing  that  searching  to  minimize  training  error  is  prob- 
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ably  often  somewhat  easier  than  searching  to  mini¬ 
mize  hold-out  error,  which  STANDARD- WRAP  requires; 
for  example,  in  linear  regression,  we  have  fast  algo¬ 
rithms  for  simultaneously  evaluating  training  error  for 
all  single-feature  changes  to  a  feature  subset.)  This 
bound  is  also  easily  generalized  to  learning  from  noisy 
examples  (again  by  replacing  all  occurrences  of  m  in 
any  denominator  term  with  (1  —  2rj)'^m). 

In  any  case,  the  key  point  of  this  bound  is  then  the 
following:  The  dependence  of  our  bound  on  /  is  only 
logarithmic  in  f.  It  is  also  easy  to  see  from  the  bound 
that  the  sample  complexity  m  is  also  logarithmic  in  /. 
As  discussed  in  the  Introduction,  this  means  that,  from 
an  information-theoretic  point  of  view,  one  may  square 
the  number  of  features  (for  example  by  adding  all 
cross- terms  between  all  features),  and  expect  to  need 
only  twice  as  much  training  data.  We  believe  that  this, 
if  even  only  approximately  realizable  by  search  algo¬ 
rithms,  may  have  tremendous  consequences  for  feature 
design  -  that  modulo  computational  experige,  overly 
careful  human  design  of  features  would  oJfen  be  un¬ 
necessary,  so  long  as  additional  training  data  can  be 
obtained  reasonably  cheaply. 

To  close  this  section,  we  informally  restate  our  the¬ 
oretical  results  in  terms  of  upper  bounds  on  sample 
complexity,  if  the  target  concept  is  well  represented  by 
some  small  number  r*  of  features.  That  is,  we  want 
the  number  m*  of  examples  required  so  that  general¬ 
ization  error  will  be  close  to  that  of  the  optimal  hy¬ 
pothesis  that  uses  r*  features.  (Slightly  more  formally, 
we  want,  for  any  fixed  e  >  0,  that  e{h)  <  eg{r*)  +  e 
with  high  probability,  and  where  dependence  of  m*  on 
e  will  again  be  hidden  by  the  O(-)  notation.)  From 
the  earlier  theorems,  it  is  not  difficult  to  derive  the 
following  (upper  bounds  on)  sample  complexity: 


algorithm 

m* 

No  feature  selection 

O(fvc) 

STANDARD- WRAP 

0{r*vc  +/“),«<! 

ORDERED-FS 

0{r*yc  +r*  log/) 

Particularly  if  rye  grows  superlinearly  in  r,  we  easily 
see  STANDARD- WRAP  has  a  significantly  smaller  sam¬ 
ple  complexity  than  not  performing  feature  selection 
if  r*  <C  /.  This  appears  to  us  to  be  rather  strong  the¬ 
oretical  justification  for  performing  feature  selection, 
thereby  answering  the  question  of  “why  feature  selec¬ 
tion”  raised  in  the  Introduction.  Also,  when  r*  <C  /, 
ORDERED-FS,  which  has  sample  complexity  logarith¬ 
mic  in  /,  is  likely  to  learn  with  many  fewer  training 
examples  than  STANDARD- WRAP. 


4  Experimental  Results 

Our  theoretical  results  predicted  ORDERED-FS  to  be 
much  more  tolerant  to  having  a  large  number  of  ir¬ 
relevant  features  than  STANDARD-WRAP.  To  test  this 
hypothesis,  we  ran  both  algorithms  on  a  small,  artifi¬ 
cial  feature  selection  problem. 

The  learning  algorithm  used  was  logistic  regres¬ 
sion  [McCullagh  and  Nelder,  1989],  used  to  fit  a  linear 
discriminant  function,  and  which,  while  not  minimiz¬ 
ing  training  error,  approximates  that  reasonably.  The 
input  space  was  X  =  TZ^ ,  and  the  first  target  concept 
c  we  used  had  only  one  relevant  feature: 

,  .  _  f  1  if  xi  +  0.2  >  0 
^  Q  otherwise 

Training  examples  were  corrupted  at  a  noise  rate 
T)  =  0.3,  and  all  input  features  were  i.i.d.  zero-mean 
unit  variance  normally  distributed  random  variables. 
The  search  heuristic  was  beam  search/ for  ward  search 
(starting  out  with  the  empty  set  of  features,  and  in¬ 
crementally  adding  features  until  we  have  the  full  set 
of  features).  Forward  search  is  a  popular  choice  that 
appears  to  usually  do  well  [Miller,  1990],  and  beam 
search,  with  a  beam  width  of  50  in  our  case,  should 
be  a  strict  improvement.  (Notice  also  that,  while 
ORDERED-FS  was  originally  formulated  as  consisting 
of  /  -I-  1  separate  searches,  it  is  probably  most  nat¬ 
urally  implemented  as  carrying  out  all  the  searches 
“together”;  our  beam  search  implementation,  which 
starts  from  zero  features  and  incrementally  considers 
higher  numbers  of  features,  is  one  example  of  such.) 
Unlike  many  “real  life”  problems,  all  of  our  input  fea¬ 
tures  are  independent,  and  so  there  were,  for  example, 
no  complicated  interactions  between  them  that  could 
complify  the  search  procedure.  For  STANDARD-WRAP, 
we  are  searching  for  a  feature  set  F  so  that  training 
on  S'\f  would  give  low  hold-out  error.  For  ORDERED- 
FS,  we  are  searching,  for  each  r,  for  a  feature  set  F  of 
size  r  so  that  training  on  5'!^  gives  low  training  error. 
In  the  rest  of  this  section,  we  will  not  distinguish  be¬ 
tween  the  “idealized”  versions  of  these  two  algorithms 
and  the  approximate  versions  of  the  algorithms.  All 
experimental  results  reported  here  are  averages  of  200 
independent  trials. 

For  both  algorithms,  the  hold-out  fraction  7  is  a 
parameter  that  had  to  be  chosen.  The  analysis 
of  [Kearns,  1996]  suggests  that,  for  a  wide  range  of 
hold-out  testing  applications,  7  «  0.3  is  a  good  choice 
(though  it  is  unclear  STANDARD- WRAP  would  fall  into 
his  framework).  Using  this  as  an  initial  choice  for  7, 
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we  obtain  Figure  1,  as  we  vary  the  total  number  of 
features.  We  see  from  the  graph  that  ORDERED-FS  is 
performing  significantly  better  on  this  domain.  For 
reference,  the  performance  of  learning  without  feature 
selection,  using  all  the  features  and  not  saving  any 
data  for  hold-out  testing,  has  also  been  plotted;  for 
this  problem,  this  is  not  really  a  competitive  algorithm 
(and  it  is  only  very  slightly  competitive  on  the  other 
target  concept  we  test),  and  we  omit  it  from  the  rest 
of  our  graphs. 

Earlier,  our  bound  had  predicted  that  as  /  increases, 
the  dominating  factor  for  the  error  of  STANDARD- 
WRAP  comes  from  testing  2^  hypotheses  on  ym  hold¬ 
out  samples,  thereby  possibly  “overfitting”  the  hold¬ 
out  data.  For  standard-wrap,  it  is  therefore  natural 
to  see  if  increasing  the  hold-out  fraction  7  might  alle¬ 
viate  this  effect.  Doing  so,  we  obtain  Figure  2,  which 
shows  results  for  standard-wrap  using  7  =  0.3,  0.5, 
and  0.7.  While  still  inferior  to  ORDERED- ES,  the  choice 
of  7  =  0.5  does  appear  to  give  better  performance  for 
large  /,  and  for  the  remainder  of  our  experimental  re¬ 
sults,  we  report  results  using  STANDARD-WRAP  with 
7  =  0.3  and  0.5. 


1  r«l«viint  (•■tur*  30%n<ji(«  100  (oirtrilai 


training  on  all  the  data  (dot),  of  STANDARD- 
WRAP  (dash)  with  7  =  0.3  and  ordered-fs 
(solid)  with  7  =  0.3.  Vertical  dashes  are  Ise. 


7  =  0.7  (dot).  Vertical  dashes  are  Ise. 


Next,  as  we  vary  m,  keeping  the  total  number  of  fea¬ 
tures  at  20,  Figure  3  shows  ORDERED-FS  still  consis¬ 
tently  beating  STANDARD-WRAP.  Lastly,  performing 
similar  experiments  with  a  new  target  function,  this 
time  with  3  relevant  features 

1  if  .Ti  +  3:2  -f  .T3  >  0 
0  otherwise 

we  obtain  Figures  4  and  5,  which  both  show  ORDERED- 
FS  performing  significantly  better. 


3D  tontiinM  1  rel«vitr.i  30*'. 


7  =  0.3  (dash)  and  7  =  0.5  (dot-dash),  and 
ORDERED-FS  with  7  =  0.3  (solid). 


3  rv'r-'Knt  f»kn>r*(  30".  nolt*  300 


ORDERED-FS.  Target  has  3  relevant  features. 
(Same  legend  as  Figure  3.) 


30  total  3  rkikvknl  3Crw  nox* 


ORDERED-FS.  Target  has  3  relevant  features. 
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5  Discussion  and  Conclusions 

Our  experimental  results  showed  our  heuristic-search 
version  of  ORDERED-FS  generally  beating  that  of 
STANDARD- WRAP.  Of  course,  we  do  not  claim  that  this 
will  always  be  the  case;  indeed,  a  more  detailed  analy¬ 
sis  than  we  had  given  suggests  STANDARD- WRAP  might 
do  slightly  better  than  ORDERED-FS  when  the  number 
of  relevant  features  is  large,  for  example  if  r  «  0.5/. 
(But  then,  this  is  often  also  the  case  when  feature  se¬ 
lection  is  less  useful,  compared  to  learning  on  the  entire 
set  of  features.) 

Throughout  the  paper,  we  skirted  the  issue  of  compu¬ 
tational  expense  in  (approximately)  finding  the  best 
(in  the  training  or  hold-out  error  sense)  set  of  fea¬ 
tures.  Indeed,  we  believe  that  much  work  remains  to 
be  done  on  this  field,  perhaps  particularly  in  design¬ 
ing  algorithms  for  finding  feature  subsets  that  mini¬ 
mize  training  error  such  as  ORDERED-FS  requires;  for 
example,  we  have  very  efficient  algorithms  for  per¬ 
forming  forward  and  backward  search  for  linear  regres¬ 
sion  [Miller,  1990],  but  few  generalizations  or  fast  ap¬ 
proximations  thereof  to  other  algorithms.  Moreover, 
for  our  bounds  to  predict  actual  performance  well  on 
real  problems,  we  have  to  rely  on  these  heuristics  to 
perform  well,  though  rigorous  bounds  for  performance 
using  search  heuristics  can  also  be  given  if  we  can 
bound  how  well  the  heuristic  performs  the  required 
search/optimization.  In  particular,  if  heuristic  approx¬ 
imation  to  STANDARD- WRAP  finds  only  a  feature  sub¬ 
set  that  comes  within  only  e+  of  minimizing  hold-out 
error,  then  a  rigorous  bound  for  its  generalization  er¬ 
ror  is  the  same  as  for  STANDARD-WRAP  with  an  ad¬ 
ditional  £+  term.  For  ordered-fs,  if  for  each  value 
of  r,  we  succeed  in  finding  only  a  feature  subset  that 
comes  within  £+(r)  of  minimizing  training  error  over 
all  feature  subsets  of  size  r,  then  a  rigorous  bound  for 
generalization  error  is  the  same  as  for  ORDERED-FS  but 
with  an  additional  £+{r)  term  in  the  {}  curly  brack¬ 
ets.  (We  defer  proofs  and  a  more  detailed  discussion 
of  implications  to  the  full  paper.)  Nevertheless,  search 
heuristics  are  then  immediately  seen  to  be  trying  to 
drive  e+  or  £+(r)  to  zero,  and  can  therefore  be  argued 
to  be  trying  to  reach  the  performance  suggested  by 
our  bounds.  (However,  one  other  surprising  effect  not 
modeled  by  our  bounds  and  which  deserves  mention  is 
that  when  STANDARD- WRAP  is  “badly”  overfitting  the 
hold-out  data,  then  our  earlier  work  suggests  that  even 
randomly  throwing  sofaie  subset  of  the  2^  hypotheses 
away  may  improve  performance  [Ng,  1997].  This  sug¬ 
gests  that  in  such  somewhat-degenerate  cases,  using  a 
weaker  search  heuristic  may  actually  be  helpful.  In  our 


experiments,  we  did  manage  to  find  parameter  ranges 
that  seemed  to  exhibit  this  effect;  but,  we  do  not  know 
how  prevalent  this  effect  is  in  practice,  and  would  of 
course  recommend  using  a  good  optimization  criteria, 
like  ORDERED-FS’s,  rather  than  using  a  less-sound  cri¬ 
teria  and  then  to  trying  to  do  a  poor  job  in  optimizing 
it.) 

Finally,  using  techniques  similar  to  those  used  in  this 
paper,  it  is  possible  to  derive  other  algorithms  or  mod¬ 
ified  versions  of  our  algorithm  that,  like  ORDERED-FS, 
have  strong  theoretical  properties  regarding  tolerance 
to  the  presence  of  many  irrelevant  features,  and  which 
may  have  slightly  different  strengths  and  weaknesses 
than  ORDERED-FS;  and  we  discuss  a  number  of  them 
in  detail  in  the  full  paper.  But  for  now,  a  significant 
result  of  this  work  is  that  with  appropriate  feature  se¬ 
lection,  sample  complexity  becomes  logarithmic  in  the 
number  of  irrelevant  features,  so  that  we  can  handle 
exponentially  many  irrelevant  features  as  training  ex¬ 
amples.  Of  course,  we  still  have  rely  on  search  heuris¬ 
tics  to  help  us  reach  these  bounds,  and  while  much  em¬ 
pirical  work  remains  to  be  done  evaluating  ORDERED- 
FS  and  comparing  it  to  STANDARD-WRAP  and  possible 
interpolations  between  the  two  algorithms,  we  also  be¬ 
lieve  that  being  able  to  give  these  bounds  is  very  en¬ 
couraging,  because  it  means  that  if  they  are  even  only 
approximately  realizable  by  search  algorithms,  they 
may  have  tremendous  consequences  for  feature  design 
-  that  modulo  computational  expense,  overly  careful 
human  design  of  features  may  often  be  unnecessary, 
so  long  as  additional  training  data  can  be  obtained 
reasonably  cheaply. 
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Abstract 

This  paper  adresses  the  problem  of  using  de¬ 
cision  lists  for  building  machine  learning  al¬ 
gorithms.  In  this  work,  we  first  highlight  the 
expressive  power  of  decision  lists,  which  were 
already  known  to  generalize  decision  trees. 
We  also  present  ICDL,  a  new  algorithm  for 
learning  simple  Decision  Lists.  This  problem 
-learning  low  size  and  high  accuracy  lists-  is, 
as  we  prove  formally,  theoretically  hard  and 
calls  for  the  use  of  heuristics  such  as  CN2, 
BruteDL  or  ICDL.  Our  method  is  based  on 
an  original  technique  midway  between  learn¬ 
ing  rule  based  procedures  and  decision  trees. 
ICDL  operates  in  two  stages  ;  it  first  greed¬ 
ily  builds  a  large  decision  list  then  prunes  it 
to  obtain  a  smaller  yet  accurate  one,  thereby 
avoiding  the  drawbacks  associated  with  the 
first  phase  alone.  Experimental  results  show 
the  efficiency  of  our  approach  by  compar¬ 
ing  them  to  the  two  well-known  algorithms 
CN2  and  C4.5.  ICDL’s  time  complexity  is 
low.  It  produces  decision  lists  whose  size  is 
far  smaller  compared  to  both  CN2  and  C4.5, 
and  whose  accuracy  also  compares  favourably 
with  theirs. 


1  Introduction 

A  Decision  List  (DL)  is  an  ordered  list  of  conjunc¬ 
tive  rules  [Riv87].  It  classifies  examples  by  assigning 
to  each  the  class  associated  with  the  first  rule  the  ex¬ 
ample  triggers.  Decision  lists  were  first  introduced  by 
[Riv87],  and  shown  to  be  very  expressive.  A  moti¬ 
vation  for  the  study  of  decision  lists  was  their  rela¬ 
tionships  with  decision  trees,  which  are  widely  used 


as  concept  representations  in  state-of-the-art  machine 
learning  algorithms  such  as  CART  [BFOS84]  or  C4.5 
[Qui93].  More  precisely,  [Riv87]  proved  that  decision 
lists  generalize  decision  trees,  which  proves  their  ex¬ 
pressive  power.  In  this  paper  we  first  give  further  in¬ 
sight  on  this  property.  We  show  that  while  it  strictly 
generalizes  decision  trees,  the  decision  list  formalism 
can  be  used  to  capture  the  expressive  power  of  deci¬ 
sion  committees  [NG95],  a  class  generalizing  multilin¬ 
ear  polynomials  [Noc98].  Moreover,  decision  lists,  un¬ 
like  decision  trees,  represent  classification  procedures 
based  on  rules  [CN89],  and  not  on  some  ordering  of 
variables.  These  two  properties  make  decision  lists  de¬ 
sirable  for  mining  sets  of  examples  or  databases,  an¬ 
alyze  their  information,  and  improve  prediction  accu¬ 
racy.  These  goals  are  important  as  many  organizations 
tend  to  have  massive  amounts  of  data,  which  they  need 
to  understand,  interpret  and  extrapolate  [KSD96]. 

The  work  of  [SE94,  KSD96,  Koh95,  CN89,  CB91] 
shows  that  any  machine  learning  algorithm  should 
meet  four  essential  requirements  to  be  of  practical  use: 

1.  Accurate  classification.  The  induced  decision  list 
should  be  able  to  classify  new  examples  accu¬ 
rately. 

2.  Noise  handling.  The  algorithm  should  work  even 
in  domains  where  there  might  be  noise. 

3.  Simple  decision  lists.  For  the  sake  of  interpretabil- 
ity,  the  induced  decision  list  should  be  as  simple 
as  possible.  This  constraint  conflicts  with  the  ac¬ 
curacy  constraint.  Generally,  satifying  both  im¬ 
plies  finding  a  good  tradeoff  between  simplicity 
and  goodness-of-fit  [NG95].  In  practice  this  im¬ 
plies  releasing  the  goal  of  finding  a  decision  list 
consistent  with  the  dataset  used  to  build  it. 
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4.  Efficient  rule  generation.  In  order  to  handle  large 
datasets,  the  algorithm  must  be  fast.  [CN89]  ar¬ 
gue  that  the  time  taken  to  generate  one  rule  in 
the  decision  list  should  be  linear  in  the  size  of  the 
dataset  used  to  build  the  decision  list. 

There  are  very  few  available  theoretical  results  allow¬ 
ing  to  conclude  positively  to  the  possibility  or  impossi¬ 
bility  of  meeting  the  four  quality  requirements  stated 
above.  It  is  not  even  known  whether  finding  simple 
and  accurate  decision  lists  is  feasible  within  a  reason¬ 
able  time  [dCGG94].  Yet  a  positive  result  would  prove 
the  existence  of  efficient  algorithms  for  this  task,  and 
a  negative  result  would  formally  prescribe  the  use  of 
heuristics  to  meet  these  criteria.  Our  second  contribu¬ 
tion  in  this  paper  is  to  show  that  meeting  the  four  re¬ 
quirements  above  for  decision  lists  is  hard.  As  has  pre¬ 
viously  been  done  for  decision  trees  [HR76],  we  prove 
that  finding  the  smallest  decision  list  consistent  with 
a  set  of  examples  is  A^P-Hard,  for  various  notions  of 
sizes.  Keeping  in  mind  the  four  requirements  men¬ 
tioned  above,  this  justifies  the  heuristic  character  of 
algorithms  such  as  CN2  [CN89],  BruteDL  [SE94],  SDL 
[dCGG94],  or  ICDL,  the  algorithm  we  propose  below. 

There  are  at  least  three  categories  of  algorithms  that 
learn  decision  lists,  each  following  a  different  construc¬ 
tion  method.  The  first  are  greedy,  iterative  algorithms 
such  as  CN2  [CN89]  which  add  rules  one-at-a-time  in 
the  decision  list  according  to  a  quality  criterion.  When 
a  rule  is  built,  the  examples  it  covers  are  removed  from 
the  training  set.  The  process  is  repeated  until  a  stop 
criterion  is  satisfied  {e.g.  the  dataset  is  exhausted). 
The  second  are  based  on  a  search  of  the  rule  space 
to  find  a  set  of  good  rules,  before  putting  them  into 
a  decision  list,  in  the  same  way  as  BruteDL  [SE94]. 
The  search  is  based  on  a  branch-and-bound  algorithm 
and  proceeds  by  specializing  iteratively  a  set  of  rules 
initialized  to  the  empty  set.  The  aim  is  to  find  all 
the  most  general  homogeneous  rules,  that  is,  those 
rules  whose  accuracy  does  not  change  when  special¬ 
ized.  The  third  arc  based  on  a  stochastic  search  of  the 
decision  list  space,  as  SDL  does  with  simulated  anneal¬ 
ing  [dCGG94]. 

However,  practical  shortcomings  have  been  observed 
in  all  these  families.  As  pointed  out  by  [SE94],  al¬ 
gorithms  such  as  CN2  suffer  from  the  rule  overlap 
problem.  When  large  decision  lists  are  greedily  con¬ 
structed,  the  last  rules  in  the  decision  list  are  built 
using  very  few  examples,  a  situation  which  has  two 
consequences;  these  rules  are  difficult  to  comprehend 


and  they  may  exhibit  low  accuracy  w.r.t.  new  exam¬ 
ples.  The  problem  with  algorithms  such  as  BruteDL 
is  that  the  search  of  the  rule  space  that  can  take  too 
long:  this  requires  the  use  of  thresholds  to  limit  the 
search  [SE94],  and  eventually  leads  to  the  construc¬ 
tion  of  very  large  decision  lists.  Finally,  algorithms 
such  as  SDL  suffer  the  problem  of  stochastic  search 
convergence,  making  it  necessary  to  run  the  algorithm 
for  long  periods,  a  situation  which  makes  them  poten¬ 
tially  time  consuming. 

Our  third  contribution  in  this  paper  is  a  new  algo¬ 
rithm  for  the  induction  of  short  decision  lists.  ICDL, 
which  stands  for  “Induction  of  CART-based  Decision 
Lists”,  is  a  two-stage  heuristic.  First,  it  proceeds  by 
building  rules  iteratively  and  greedily  using  a  proce¬ 
dure  inspired  from  decision  tree  induction.  ICDL  then 
prunes  the  decision  list  using  a  CART-like  criterion  to 
obtain  a  shorter  decision  list  used  for  testing.  ICDL 
takes  advantage  of  the  adaptation  to  decision  lists  for¬ 
malism  of  previous  successful  approaches  for  building 
decision  trees. 

The  rest  of  this  paper  is  organized  as  follows.  First  we 
give  results  completing  those  of  [Riv87]  on  the  expres¬ 
sive  power  of  decision  lists.  Then  we  give  formal  proof 
of  the  hardness  of  building  small  and  accurate  deci¬ 
sion  lists.  In  the  following  section  we  present  ICDL, 
adducing  experimental  results  which  prove  the  valid¬ 
ity  of  our  approach  w.r.t.  the  four  introductory  re¬ 
quirements.  We  then  compare  our  results  with  those 
obtained  using  CN2  and  C4.5. 

2  Expressive  power  of  Decision  Lists 

Following  [SE94],  we  let  E  denote  the  universe  of  ex¬ 
amples,  each  of  which  is  described  using  n  variables. 
In  this  section,  we  consider  for  simplicity  that  each 
variable  is  Boolean.  Given  a  set  of  n  Boolean  vari¬ 
ables,  we  let  {xi,xi,X2,x-2,  denote  the  set 

of  corresponding  literals.  We  suppose  that  examples 
are  classified  according  to  a  set  of  goal  classes  de¬ 
noted  G.  We  note  a  rule:  t  g,  where  g  E  G, 
and  f  is  a  non-empty  monomial,  that  is,  a  conjunc¬ 
tion  of  literals.  We  note  a  decision  list  with  k  rules: 
(ti  ^  3i),(<2  -t  g2),--,{h  gk),gk+\  where  the 
class  gk+i  is  called  the  default  class.  The  class  as¬ 
sociated  to  any  example  e  E  E  is  the  goal  class  cor¬ 
responding  to  the  first  monomial  satisfied  by  the  ex¬ 
ample.  If  none  is  passed,  the  example  is  assigned  the 
default  class.  VO  <  fc  <  n,  we  let  A>DL  denote  the 
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set  of  decision  lists  whose  monomials  have  at  most  k 
literals.  Let  DT  stand  for  the  class  of  binary  decision 
trees,  as  described  e.g.  in  [Riv87].  Let  fc-DT  be  the 
set  of  decision  trees  having  depth  at  most  k,  which  is 
the  size  of  the  longest  path  from  the  root  to  a  leaf  of 
the  tree.  In  his  seminal  article  on  DL,  [Riv87]  proved 
how  decision  lists  generalize  decision  trees,  by  study¬ 
ing  the  inclusion  relationships  between  classes  fc-DL 
and  A:-DT.  More  formally,  we  have 

Theorem  1  [Riv87]  VO  <  A:  <  n,  k-DT  C  k-DL. 

[NG95]  have  presented  a  new  class  of  Boolean  formal¬ 
ism  allowing  to  precise  the  real  power  of  decision  lists 
:  the  decision  committees.  A  decision  committee  con¬ 
tains  two  parts:  a  set  of  unordered  couples  (or  rules) 
where  each  U  is  a  monomial  over  {0,1,*}"' 
and  each  Vi  is  a  vector  in  (in  the  two-classes  case, 
it  is  sufficient  to  add  a  single  number  rather  than  a  2- 
component  vector).  It  also  contains  a  Default  Vector 
D  in  [0, 1]°.  Again,  in  the  two-classes  case,  the  reader 
shall  remark  that  D  can  be  replaced  by  a  default  class 

The  classification  of  any  example  e  £  E  is  made  by 
summing  in  a  vector  Ve  the  vectors  of  each  rule  e  sat¬ 
isfies.  If  the  maximal  component  of  Ve  is  unique  then 
its  index  gives  the  class  assigned  to  e.  Otherwise,  we 
take  the  index  of  the  maximal  component  of  D  corre¬ 
sponding  to  the  maximal  components  of  Ve-  Let  DC 
stand  for  the  whole  class  of  decision  committees.  De¬ 
fine  Vk  <  n,  fc-DC  to  be  the  subclass  of  DC  where  each 
element  has  monomials  of  length  <  k.  The  following 
theorem  shows  that  decision  committees  are  at  least 
as  expressive  as  decision  lists: 

Theorem  2  [NG95]  \f0  <  k  <  n  constant,  k-DL  C 
k-DC. 

Although  theorem  1  still  holds  for  non-constant  value 
of  k,  we  have  (n  —  1)-DL  =  (n  —  1)-DC  [Noc98].  This 
states  that  there  exist  a  value  k'  such  that  Mk  >  k' , 
classes  fc-DL  and  fc-DC  coincide,  although  k-DL  still 
strictly  contains  A;-DT.  Apart  from  the  fact  that  de¬ 
cision  lists  are  rule-based  procedures  unlike  decision 
trees,  this  result  is  another  reason  to  advocate  for  use 
of  decision  lists  instead  of  decision  trees  in  machine 
learning  algorithms. 

3  Learning  small  accurate  DLs 

In  the  introduction,  we  have  presented  four  require¬ 
ments  which  should  try  to  meet  any  efficient  learning 
algorithm.  Previous  studies  on  decision  trees  [HR76] 


and  decision  committees  [NC95]  have  established  that 
they  are  hard  to  satisfy  for  machine  learning  algo¬ 
rithms  using  these  respective  classes.  These  results 
justify  a  part  of  the  heuristic  nature  of  algorithms  such 
as  CART  [BFOS84],  C4.5  [Qui93],  or  IDC  [NC95].  We 
now  prove  that  this  aim  is  also  intractable  for  DL 
and  thereby  offer  a  positive  answer  to  a  conjecture 
of  [dCCC94].  We  define  the  size  of  a  DL  as  the  to¬ 
tal  number  of  literal  occurences  in  the  decision  list  (if 
a  literal  appears  k  times,  it  is  counted  k  times).  This 
definition  is  very  close  to  the  one  used  in  [Ris78]  which 
is  the  smallest  number  of  bits  needed  to  write  down  a 
procedure,  given  an  optimal  encoding. 

Theorem  3  It  is  NP-Hard  to  find  the  smallest  deci¬ 
sion  list  consistent  with  a  set  of  examples  LS. 

Proof:  We  use  a  reduction  from  the  VP-Hard  “Min¬ 
imum  Cover”  problem  [CJ79]: 

•  Name  :  “Minimum  Cover”. 

•  Instance  :  A  collection  C  of  subsets  of  a  finite 
set  S.  A  positive  integer  K,  K  <  Card(C'),  where 
Card(.)  denotes  the  cardinality. 

•  Question :  Does  C  contain  a  cover  of  size  at  most 
K,  that  is,  a  subset  C  C  C  with  Card(C")  <  K, 
such  that  any  element  of  S  belongs  to  at  least  one 
member  of  C  ? 

The  reduction  is  constructed  as  follows  ::  from  a  “Min¬ 
imum  Cover”  instance  we  build  a  learning  sample  LS 
such  that  if  there  exists  a  cover  of  size  Card(C")  <  K 
of  S,  then  there  exists  a  decision  list  with  Card(C'') 
literals  consistent  with  LS,  and,  reciprocally,  if  there 
exists  a  decision  list  with  k  literals  consistent  with  LS, 
then  there  exists  a  cover  of  size  k  of  5.  Hence,  finding 
the  smallest  decision  list  consistent  with  LS  is  equiva¬ 
lent  to  finding  the  smallest  K  for  which  there  exists  a 
solution  to  “Minimum  Cover” ,  and  this  is  intractable 
if  P  7^  VP. 

Let  Cj  denote  the  element  of  C,  and  Sj  the 
element  of  S.  We  define  a  set  of  Card(C')  Boolean 
variables  {vi,V2,  •••,  Vcard(c)})  in  one  to  one  correspon¬ 
dence  with  the  elements  of  C,  which  we  use  to  de¬ 
scribe  the  examples  of  LS.  The  corresponding  set  of 
literals  is  denoted  {xi,xi,X2,X2,  ■..,3:card(c))^card(c)}- 
Our  reduction  uses  two  classes,  one  positive  and  one 
negative  respectively  denoted  by  “1”  and  “0”.  The 
sample  LS  contains  two  disjoint  subsets  :  the  set  of 
positive  examples  LS+,  and  the  set  of  negative  ones 
LS~ .  LS'^  contains  Card(5)  examples,  denoted  by 
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{ej",  ej, We  construct  each  positive  ex¬ 
ample  so  that  it  encodes  the  membership  of  the  cor¬ 
responding  element  of  5  in  the  subsets  of  S  present 
in  C.  More  precisely,  VI  <  i  <  Card(S),  ef  = 

^  contains  a  single 

negative  example,  defined  by:  ^  • 

•  Suppose  there  exists  a  cover  C  of  5  satisfying 
Card(C')  <  K.  We  create  a  single  rule  decision  list: 
{t  ->  0),1.  t  equals  AjxjgC'^j-  Since  any  element 
of  S  belongs  to  an  element  of  C",  no  positive  example 
passes  the  monomial  t.  Thus  all  positive  examples  are 
correctly  classified,  as  well  as  the  negative  example, 
which  satisfies  t.  This  decision  list  contains  Card(C") 
literals,  and  is  consistent  with  LS. 

•  Suppose  there  exists  a  DL  li  with  k  literals  consis¬ 
tent  with  LS.  Name  it  (ti  t  92), 

gk')i9k'+\,  with  k'  <  k.  The  three  properties  below 
hold  because  h  is  consistent. 

[PI]  Since  there  exists  only  one  negative  example,  we 
can  suppose  without  loss  of  generality  that  only  one 
goal  class  is  negative.  This  is  either  the  default  class  if 
the  negative  example  satisfies  no  monomial  inside  the 
decision  list,  or  a  goal  class  from  a  rule  whose  mono¬ 
mial  is  satisfied  by  the  negative  example.  Any  other 
goal  class  is  positive.  Let  gi  denote  the  goal  class  (or 
the  default  class)  which  is  negative,  where  I  is  an  in¬ 
teger  from  the  set  {1, 2, ...,  k'  -b  1}. 

[P2]  Any  rule  preceeding  gi  in  h  contains  a  literal  in¬ 
volving  an  equality  comparison  to  “1”.  Otherwi.se, 
the  negative  example  would  satisfy  the  corresponding 
monomial,  thereby  being  incorrectly  classified. 

[P3]  If  gi  is  not  the  default  class,  monomial  ti  is  not 
empty  and  contains  only  negative  literals.  Otherwise, 
the  negative  example  would  not  pass  the  monomial. 

We  now  create  two  subsets  of  C,  namely  C[  and  C2, 
whose  union  C"  =  C(  U  C2  is  a  cover  of  5  with  at  most 
k  subsets. 

0  if  /  =  1 

{ci  :  31  <  j  <  l,Xi  e  tj}  if  /  >  1 

If  gi  is  not  the  first  goal  class,  C{  contains  all  indices  of 
positive  literals  appearing  in  rules  tj  cj  with  j  <  1. 

0  if  /  =  A:'  -b  1 

{ci  :Xi  £  ti}  if  /  <  A:'  -b  1 

C2  contains  all  indices  of  literals  variables  appearing 
in  ti,  if  gi  is  not  the  default  class. 

At  least  one  of  the  two  subsets  is  not  empty.  Oth¬ 
erwise,  that  would  mean  I  =  1  =  A:'  -b  1,  leading  to 


A:'  —  0  :  the  decision  list  would  consist  in  one  default 
class  for  all  negative  and  positive  examples,  which  is 
impossible. 

Because  the  DL  is  consistent,  any  positive  example 
either  satisfies  a  monomial  before  ti,  or  does  not 
satisfy  monomial  t;.  In  the  first  case,  property  [P2] 
implies  that  the  example  has  some  positive  literal  in 
common  with  ;  therefore,  one  element  in  C[  con¬ 
tains  the  element  of  S  from  which  the  positive  example 
was  created.  In  the  second  case,  if  the  positive  ex¬ 
ample  does  not  satisfy  monomial  A;,  then  by  property 
[P3],  there  is  in  A;  at  least  one  negative  literal  it  does 
not  satisfy.  To  this  negative  literal  in  A(  corresponds 
a  positive  one  in  the  example,  and  therefore  there  ex¬ 
ists  an  element  of  €'2  containing  the  element  of  S  from 
w'hich  the  positive  example  was  created.  The  union 
C[ll  C2  =  C  contains  at  most  k  elements,  and  is  a 
cover  of  S.  This  achieves  the  proof.  □ 

The  limitation  of  the  whole  number  of  literals  of  a 
decision  list  is  one  of  the  finest  size  notion,  since  it 
comes  close  to  the  one  of  [Ris78].  However,  the  prob¬ 
lem  of  constructing  decision  lists  with  limited  complex¬ 
ity  is  also  hard  for  a  relaxed  notion  of  size  presented 
in  [KLPV87]  :  the  whole  number  of  rules. 

Theorem  4  [Noc98]  It  is  NP-Hard  to  find  the  small¬ 
est  decision  list  consistent  with  a  set  of  examples  LS,  if 
the  notion  of  size  is  the  number  of  rules  of  the  decision 
list. 

4  ICDL 

We  now  present  our  two-stage,  decision  list  learning 
algorithm,  ICDL.  The  first  stage  consists  of  the  greedy 
construction  of  a  large  decision  list,  and  the 

second  one  prunes  to  obtain  dl,,,„i,  the  decision 

list  used  for  testing. 

4.1  Building  dl,„„x 

Table  1  presents  a  pseudo-code  description  of  the  al¬ 
gorithm  used  to  build  dl|„„x,  as  well  as  two  procedure 
it  uses,  MakeRuleO  and  BestL().  Function  Gini() 
returns  the  value  of  the  Gini  criterion  [BFOS84]  of 
a  decision  list  in  the  following  way.  Let  (Ai  —> 
9i},(t2  52),  ■■•,(<*■  gk),gk+i  denote  the  decision 

list  CurrentDL.  VI  <  i  <  A  -b  1,V1  <  j  <  Card(G), 
let  LSij  C  LS  stand  for  the  subset  of  examples  from 
class  j  that  are  classified  by  goal  class  7,  (which  is  the 
default  class  if  i  =  A  -b  1)  ;  VI  <  i  <  A  -b  1,  define 
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BuildLmaxO 

DecisionList  :=  MakeEmptyDLO ; 
StopDLConstruction  ;=  FALSE; 

WHILE  StopDLConstruction  =  FALSE  DO 

CurrentRule  ;=  MakeRule (DecisionList) ; 
IF  NotEmpty (CurrentRule)  THEN 

AddLast (CurrentRule,  DecisionList) ; 
ELSE  StopDLConstruction  :=  TRUE; 

END 

END 


PruneDL (DecisionList) 

DLSequence  :=  MakeEmptySequenceO ; 
CurrentDL  :=  DecisionList; 

WHILE  NotEmpty (CurrentDL)  DO 

DLSequence  :=  DLSequence  +  CurrentDL; 
CurrentLit  ;=  LitToPrune (CurrentDL) ; 
Prune (CurrentDL,  CurrentLit); 

END 

ReturnBestDL (DLSequence) ; 

END 

Table  2;  Pseudocode  for  pruning  dl^axto  obtain  dlend- 
LSi  =  uf  VI  <  i  <  A:  +  1,  define 


MakeRule (CurrentDL) 

newRule  :=  MakeEmptyRuleO ; 
StopRuleConstruction  :=  FALSE; 

WHILE  StopRuleConstruction  =  FALSE  DO 
LTest  :=  BestL (CurrentDL,  newRule) ; 
IF  NotEmpty (LTest)  THEN 

AddLiteral (LTest,  newRule); 

ELSE  StopRuleConstruction  :=  TRUE; 
END 

Return (newRule) ; 

END 

BestL (CurrentDL,  Rule) 
newDL  : =  CurrentDL ; 
newRule  :=  Rule; 

GiniOpt  :=  Gini (CurrentDL) ; 
optimalL  :=  MakeEmptyLiteralO ; 

FOR  LTest  :=  FirstLTest  to  LastLTest  DO 
AddLiteral (LTest,  newRule); 

AddLast (newRule,  newDL) ; 

IF  Gini (newDL)  <  GiniOpt  THEN 
optimalL  :=  LTest; 

GiniOpt  :=  Gini (newDL); 
newDL  : =  CurrentDL ; 
newRule  : =  Rule ; 

END 

Return(optimalL)  ; 

END 


n-  •  C&rdjLSij)  Card(LSi,fc) 

Gini(ij  ^  Card(L5i)  Card(L5i) 

The  Gini  criterion  measured  for  CurrentDL  equals 


Gini (CurrentDL)  = 


Card(£5i) 

Card(L5) 


BuildLmaxO  adds  rules  at  the  end  of  a  decision  list 
initialized  to  the  empty  decision  list,  using  the  pro¬ 
cedure  AddLast ().  Each  rule  is  constructed  using 
MakeRule ().  This  construction  consists  in  building 
a  rule  by  adding  literals  one-at-a-time  using  the  pro¬ 
cedure  BestL().  The  best  literal  returned  by  BestLO 
is  the  one,  if  it  exists,  that  satisfies  the  two  following 
conditions: 


•  it  diminishes  the  most  the  Gini  criterion  of  the 
whole  decision  list,  for  any  addition  of  literal  in 
the  current  rule  constructed,  and 

•  it  dimishes  the  value  of  the  Gini  criterion  com¬ 
pared  to  the  value  of  the  decision  list  before  ad¬ 
dition  of  the  literal. 


BuildLmaxO  stops  when  any  one-literal  rule  added  at 
the  end  of  the  decision  list  fails  to  lower  the  value  of 
Gini  criterion. 


Table  1:  Pseudocode  for  the  building  of  dl„.„„.  4.2  Pruning  of  dl.„axto  obtain  dle„d 

Table  2  presents  a  pseudo-code  description  of  the  al¬ 
gorithm  used  to  prune  dl^ax,  to  obtain  dlend-  In 
our  experiments,  the  examples  set  used  to  prune  is 
different  from  the  one  used  to  construct  dlmax-  At 
each  step,  the  literal  to  be  pruned  is  returned  by  the 
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procedure  LitToPruneO,  and  is  the  literal  i,  among 
all  those  of  CurrentDL,  which  minimizes  the  function 
PruneValue(CurrentDL,  1); 

PruneValue (CurrentDL,  i)  = 

_ DeltaCCurrentDL ,  J) _ 

N(CurrentDL,  J)  x{NDT (CurrentDL,  f)-i) 

Delta(CurrentDL ,  i)  returns  the  number  of  examples 
well  classified  by  CurrentDL,  and  that  are  no  longer 
well  classified  when  we  remove  literal  J.  N  (CurrentDL, 
i)  returns  the  number  of  examples  that  do  not 
pass  any  literal  proceeding  I  in  the  decision  list. 
NDT (CurrentDL,  1)  returns  the  product  of  sizes  of  all 
rules  following  the  one  in  which  I  is  present.  This  cri¬ 
terion  is  analoguous  to  the  pruning  criterion  used  in 
CART  [BFOS84],  measured  over  a  decision  tree  built 
from  the  decision  list  such  that  each  node  in  the  deci¬ 
sion  tree  is  one  literal  corresponding  to  an  literal  in  the 
decision  list.  We  refer  the  reader  to  [BFOS84]  for  addi¬ 
tional  details  on  this  criterion,  died,  is  the  DL  which, 
among  all  DLs  pruned  from  has  the  highest  ac¬ 

curacy  over  the  examples  used  to  prune  dim,,,. 

5  Experimental  results 

ICDL  was  run  on  several  benchmark  problems.  Its  re¬ 
sults  are  compared  to  those  of  CN2  [CN89]  and  C4.5 
[Qui93].  Table  3  presents  the  datasets  which  were 
used  for  comparisons  (c  is  a  shorthand  for  Card(G)). 
Dataset  references  are  [CB91]  for  VO,  PC,  GL,  HH, 
HC,  EC,  HP,  [TBB+91]  for  Ml,  M2,  M3  and  [BFOS84j 
for  WO.  All  datasets,  except  WO  (artificial  problem 
generated  following  [BFOS84]),  can  be  found  in  the 
UCI  repository  of  Machine  Learning  database. 

A  learning  sample  LS  is  split  in  two  subsets,  the 


Table  3:  Characteristics  of  Data  Sets. 


Card(L5) 

c 

Te.st  set 

Comments 

VO 

435 

2 

0 

Congress  votes 

WO 

10x300 

3 

5000 

Waveform  recognition 

Ml 

124 

2 

432 

MONKS  #1 

M2 

169 

2 

432 

MONKS  #2 

M3 

122 

2 

432 

MONKS  #3 

PC 

1044 

2 

0 

Pole  and  Cart 

GL 

214 

6 

0 

Gla-ss  recognition 

HH 

294 

2 

0 

Heart  Hungary 

HC 

303 

2 

0 

Heart  Cleveland 

EC 

131 

2 

0 

Echocardio 

HP 

157 

2 

0 

Hepatitis 

first  used  for  building  dl,„ax(2/3  of  the  examples),  and 
the  second  one  for  pruning  to  obtain  dland(l/3  of  the 


Table  4:  ICDL  vs  CN2. 


ICDL  acc 

CN2  acc 

ICDL  size 

CN2  size 

VO 

95.9±1.0 

94.8±1.7 

3.1  ±  1.6 

41.6±8.2 

WO 

66.5T4.6 

65.6±4.3 

21.8±6.8 

28.6±5.1 

Ml 

83.3 

100 

6 

13 

M2 

65.1 

69 

14 

145 

M3 

100 

89.1 

4 

38 

PC 

70,7±1.5 

70.6±3.1 

14.8±9.8 

133.6T6.3 

GL 

58.9±6.4 

58.5±5.0 

12.6±3.9 

32.8±3.0 

HH 

79.4±3.4 

75.4±3.6 

13.4±6.9 

35.1±2.5 

HC 

79.0±4.2 

75.0±3.8 

15.5±7.5 

40.9±4.0 

EC 

70.7±4.7 

62.3±5.1 

4.7T2.4 

26.4±4.0 

HP 

79.6±3,5 

77.6±5.9 

4.2±2.8 

24.0±5.5 

Avg. 

77.19 

76.17 

10.37 

50.81 

Results  for  CN2  are  those  of  [CB91]  (VO,  PC,  GL,  HH,  HC, 
EC,  HP),  [TBB+91]  (Ml,  M2,  M3)  and  [dCGG94]  (WO). 


examples).  When  only  one  dataset  exists  for  learn¬ 
ing  and  testing  (which  is  the  case  for  all  datasets  ex¬ 
cept  WO,  Ml,  M2,  M3),  we  proceed  by  averaging  over 
10  iterations  the  result  of  the  following  split-and-build 
experiment:  randomly  split  the  whole  sample  into  a 
learning  sample  (2/3  of  the  examples)  and  a  test  sam¬ 
ple  (1/3  of  the  examples);  use  the  learning  sample  to 
construct  a  decision  list  with  ICDL,  and  test  it  on  the 
test  set  (ratios  follows  [CB91]).  Therefore,  in  that  case, 
4/9  of  the  examples  are  used  for  building  dl,„„x,  2/9 
are  used  for  building  dlond,  and  1/3  are  used  for  eval¬ 
uating  the  accuracy  of  dlp„d.  Table  4  presents  ICDL’s 
results  compared  to  CN2’s  (“acc”  means  accuracy  ; 
“size”  stands  for  the  whole  number  of  literals  in  the 
formula  ;  “Avg,”  gives  average  results)  ;  when  there 
are  more  than  one  learning  sample  (WO),  or  when 
split-and-build  is  used,  results  are  given  in  the  form 
“Mean±Standard  deviation” . 

On  all  but  two  problems,  ICDL  achieves  better  accura¬ 
cies  than  CN2.  Results  are  even  more  favourable  if  we 
take  into  account  the  sizes  obtained.  In  all  datasets, 
the  DLs  found  by  ICDL  are  much  smaller  than  those 
of  CN2.  If  we  exclude  WO,  sizes  obtained  for  ICDL 
are  up  to  fourteen  times  smaller  than  CN2’.s. 
Comparisons  with  the  state-of-the-art  decision  tree 
learning  algorithm  C4.5  are  presented  in  table  5.  De¬ 
cision  tree  size  is  the  number  of  nodes  including  leaves 
[CB91].  To  make  correct  comparisons,  a  DL  size  given 
is  now  total  number  of  literals  plus  total  number  of 
classes  in  the  DL.  With  the  exception  of  PC  and  GL, 
ICDL  outperforms  C4.5  on  all  datasets.  Again,  the 
size  comparison  points  out  important  differences,  that 
are  on  average  in  favor  of  ICDL.  However,  the  gap  is 
less  important  than  for  CN2  and  on  three  problems. 
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Table  5:  ICDL  vs  C4.5. 


ICDL  acc 

C4.5  acc 

ICDL  size 

C4.5  size 

VO 

95.9±1.0 

95.6±1.1 

6.9  ±2.7 

7.7±3.4 

Ml 

83.3 

80.6 

10 

M2 

65.1 

64.8 

23 

M3 

100 

97.2 

8 

PC 

70.7±1.5 

74.3T3.1 

22.6±13.2 

90.2±10.2 

GL 

58.9±6.4 

64.2±5.1 

18.9±5.2 

30.9±5.8 

HH 

79.4±3.4 

78.0±5.5 

21.7±10.0 

7.2±3.7 

HC 

79.0±4.2 

76.4±4.5 

24.0L9.9 

22.7±4.6 

EC 

70.7±4.7 

63.6±5.3 

9.1L3.7 

9.2±4.7 

HP 

79.6±3.5 

79.3±5.8 

7.5±3.7 

6.4±2.6 

Avg. 

78.26 

77.4 

15.81* 

24.9* 

Results  for  C4.5  are  those  of  [CB91]  (VO,  PC,  GL,  HH, 
HC,  EC,  HP)  and  [Koh95]  (Ml,  M2,  M3). 

(*)  Average  sizes  do  not  take  into  account  Ml,  M2,  M3 
(we  did  not  have  C4.5’s). 

the  formulae  sizes  found  by  C4.5  are  actually  smaller 
than  for  ICDL. 

6  Discussion 

While  the  results  above  illustrate  ICDL’s  good  per¬ 
formances,  we  now  discuss  how  close  this  algorithm 
comes  to  the  four  widely  accepted  requirements  pre¬ 
sented  in  the  introduction  :  high  accuracy,  noise  tol- 
erancy,  small  sizes,  and  low  time  complexity.  As 
pointed  out  by  [CN89],  time  complexity  needs  only 
to  be  evaluated  on  the  crucial  steps  of  the  algo¬ 
rithm.  During  the  construction  of  dl^ax)  ICDL’s 
crucial  step  is  the  same  as  CN2:  namely  the  ad¬ 
dition  of  a  single  literal  to  the  current  rule.  So 
this  complexity  is  that  of  the  BestLiteralO  sub¬ 
routine.  It  represents  C)(Card(L5)  x  Card(Atests)) 
in  ICDL,  where  “Atests”  denotes  the  set  of  possi¬ 
ble  literals.  This  complexity  is  smaller  than  that  of 
CN2  [CN89].  The  time  complexity  of  the  crucial  step 
of  the  pruning  phase  corresponds  to  the  complexity 
of  the  LiteralToPruneO  subroutine,  which  deter¬ 
mines  which  literal  is  to  be  removed  from  the  cur¬ 
rent  formula.  Its  time  complexity  is  0(Card(LS)  x 
Card(AtestsFormula)),  where  “AtestsFormula”  de¬ 
notes  the  multiset  of  literals  of  the  current  formula.  In 
fact,  ICDL’s  whole  complexity  is  not  high  w.r.t.  clas¬ 
sical  induction  algorithms. 

M2  and  VO  are  relevant  to  the  discussion  of  sizes.  In 
VO,  empirical  studies  (see  the  ML  repository)  show 
that  there  exists  a  single  literal  DL  that  performs 
around  95  %  accuracy.  We  noticed  that  the  DL  found 


by  ICDL  always  included  this  formula,  which  leads  to 
excellent  tradeoffs  between  size  and  accuracy.  On  the 
artificial  domain  M2,  the  function  to  learn  is  an  XOR 
function  [TBB+91],  which  is  very  difficult  to  encode 
with  small  DL  (compare  with  the  result  obtained  by 
CN2).  Again,  in  this  case,  ICDL  found  a  tiny  formula 
which  is  highly  accurate  considering  its  size  and  the 
difficulty  of  the  domain. 

ICDL  obtained  very  good  results  w.r.t.  the  complex¬ 
ity  vs  accuracy  tradeoff.  As  a  means  of  evaluating 
this  for  each  possible  dataset,  we  have  calculated  the 
accuracy/size  ratio,  which  constitutes  a  rough  “infor¬ 
mative  measure”  of  each  literal  with  respect  to  the 
overall  accuracy.  Provided  accuracies  are  sufficiently 
high  (which  was  the  case  for  CN2,  C4.5  and  ICDL),  the 
higher  this  ratio,  the  better  and  the  more  interesting 
the  accuracy/size  compromise  the  algorithm  obtains. 
The  calculation  shows  that,  for  each  dataset,  this  ratio 
is  higher  for  ICDL  than  for  CN2.  Furthermore,  with 
the  exception  of  Ml  and  WO,  ICDL’s  lowest  ratio  is 
higher  than  CN2’s  highest.  Finally,  if  we  exclude  the 
HH  and  HC  problems,  ICDL  also  outperforms  C4.5 
on  all  domains  for  which  we  possess  accuracy  and  size 
measures  for  C4.5.  Tables  4,  5  and  [CB91]  show  that 
for  datasets  VO,  PC,  GL,  EC,  HP,  ICDL  outperforms 
C4.5,  which  in  turn  outperforms  CN2.  ICDL’s  accu¬ 
racy  is  on  average  slightly  better  than  C4.5’s,  yet  its 
average  output  size  is  smaller.  Thus,  ICDL  appears 
to  be  able  to  compact  the  knowledge  of  the  decision 
trees  in  small  DLs.  This  result  concords  with  [Riv87], 
section  3.2,  who  shows  that  decision  lists  generalize 
decision  trees. 

Finally,  ICDL’s  handling  of  noise  can  be  experimen¬ 
tally  evaluated  using  problems  WO  and  M3,  which  are 
artificial  noisy  problems.  On  both  problems,  ICDL’s 
results  are  good.  In  M3,  there  is  a  little  noise,  and  very 
few  learning  algorithms  in  [TBB+91]  achieve  100%  ac¬ 
curacy.  ICDL  achieves  the  perfect  classification,  and 
surpasses  all  inductive  learning  algorithms  tested  in 
[TBB+91]  :  ID3,  ID5R,  AQR,  CN2  and  CLASSWEB. 
Again,  the  size  of  the  formula  found  by  ICDL  is  smaller 
than  that  of  these  algorithms. 

[SE94]  point  out  a  limiting  aspect  of  decision  list  con¬ 
struction  using  greedy  algorithms  such  as  CN2  :  rules 
cannot  be  considered  in  isolation,  and,  after  each  rule 
building  stage,  fewer  examples  are  available  to  the 
learning  algorithm.  ICDL  reduces  the  adverse  effects 
of  both  problems  by  building  small  decision  lists  using 
efficient  pruning.  Indeed,  the  pruning  step  uses  new 
examples,  which  are  not  used  for  building  dln,ax-  Fur¬ 
thermore,  it  uses  a  criterion  reducing  the  effect  of  the 
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limited  number  of  examples  available  to  the  rules  at 
the  end  of  the  decision  list. 

7  Conclusion 

In  this  paper,  we  introduce  ICDL,  a  new  algorithm 
for  learning  simple  decision  lists.  Its  originality  stems 
from  the  adaptation  to  decision  lists  of  the  combina¬ 
tion  of  two  techniques  which  have  proved  very  effi¬ 
cient  in  CART  and  C4.5.  It  combines  the  building 
of  a  decision  list  in  a  greedy  way  using  a  Gini  cri¬ 
terion  calculated  on  the  whole  decision  list.  It  then 
prunes  the  decision  list  using  a  CART-like  criterion. 
However,  its  result  formulae  are  rule-based  procedures 
which  provide  an  alternative  to  decision  trees  for  build¬ 
ing  knowledge-based  systems. 

We  prove  formally  that  inducing  short  and  accurate 
DLs  is  intractable,  which  prescribes  the  use  of  heuris¬ 
tics  such  as  CN2,  ICDL,  or  BruteDL  [SE94].  We 
then  give  experimental  results  which  compare  very 
favourably  to  those  of  CN2  and  C4.5,  and  which  show 
experimentally  that  ICDL  meets  the  accepted  criteria 
of  low-time  complexity,  noise  handling,  small  output 
size  and  high  accuracy.  We  believe  this  efficiency  is 
due  to  ICDL’s  successful  application  of  a  combination 
of  decision  tree  learning  techniques  to  the  more  expres¬ 
sive  DL  representation. 
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Abstract 

It  is  well  known  that  for  Markov  decision  pro¬ 
cesses,  the  policies  stable  under  policy  iteration 
and  the  standard  reinforcement  learning  methods 
are  exactly  the  optimal  policies.  In  this  paper, 
we  investigate  the  conditions  for  policy  stability 
in  the  more  general  situation  when  the  Markov 
property  cannot  be  assumed.  We  show  that  for  a 
general  class  of  non-Markov  decision  processes, 
if  actual  return  (Monte  Carlo)  credit  assignment 
is  used  with  undiscounted  returns,  we  are  still 
guaranteed  the  optimal  observation-based  poli¬ 
cies  will  be  equilibrium  points  in  the  policy  space 
when  using  the  standard  “direct”  reinforcement 
learning  approaches.  However,  if  either  dis¬ 
counted  rewards,  or  a  temporal  differences  style 
of  credit  assignment  method  is  used,  this  is  not 
the  case. 


1  Introduction 

The  techniques  of  reinforcement  learning  (RL)  have  been 
developed  to  effect  autonomous  learning  in  agents  interact¬ 
ing  with  an  initially  unknown  and  possibly  changing  envi¬ 
ronment.  In  its  simplest  formulation,  the  problem  of  RL  is 
cast  into  a  table  lookup  representation,  where  the  agent  can 
be  in  one  of  a  finite  number  of  states  at  any  time,  and  has 
the  choice  of  finite  number  of  actions  to  take  from  within 
each  state.  For  this  representation,  powerful  convergence 
and  optimality  results  have  been  proven  for  a  number  of  al¬ 
gorithms  designed  with  the  simplifying  assumption  that  the 
environment  is  Markov,  e.g.  1-step  Q-leaming  (Watkins, 
1989).  With  this  assumption,  the  problem  of  learning  can 

*  Current  address;  Daimler-Benz  Research  and  Technol¬ 
ogy  Center,  1510  Page  Mill  Rd,  Palo  Alto,  CA  94304,  USA.  e- 
mail:  pendrith@rtna.daimlerbenz.com 
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be  cast  into  the  form  of  finding  an  optimal  policy  for  a 
Markov  decision  process  (MDP),  and  methods  like  1-step 
Q-learning  can  be  shown  to  be  a  form  of  incremental,  asyn¬ 
chronous  dynamic  programming  (Watkins,  1989;  Barto, 
Bradtke,  &  Singh,  1995). 

In  practice,  however,  RL  techniques  are  routinely  applied 
to  many  problem  domains  for  which  the  Markov  property 
does  not  hold.  This  might  be  because  the  environment  is 
non-stationary,  or  is  only  partially  observable;  often  the 
side-effects  of  state-space  representation  can  lead  to  the  do¬ 
main  appearing  as  non-Markov  to  a  reinforcement  learning 
agent. 

In  this  paper,  we  examine  various  issues  arising  from  ap¬ 
plying  standard  RL  algorithms  to  non-Markov  decision 
processes  (NMDPs).  In  particular,  we  are  interested  in 
the  implications  of  using  a  “direct”  or  observation-based 
method  of  RL  for  a  non-Markov  problem,  i.e.  where  the 
problem  is  known  to  be  non-Markov  but  partial  or  noisy 
state  observations  are  presented  directly  to  the  RL  algo¬ 
rithm  without  any  attempt  to  identify  a  “true”  Markov  state 
(Singh,  Jaakkola,  &  Jordan,  1994). 

2  Policy  stability  in  Dynamic  Programming 

In  this  section,  we  review  the  important  idea  of  a  stable  pol¬ 
icy  in  terms  of  classical  dynamic  programming  (DP)  meth¬ 
ods. 

It  is  well  known  (e.g.  Puterman  (1994))  that  for  any  MDP, 
all  suboptimal  policies  are  unstable  under  policy  iteration 
i.e.  one  step  of  the  policy  iteration  process  will  result  in  a 
different  policy.  Moreover,  the  new  policy  will  be  a  better 
policy;  and  so  the  process  of  policy  iteration  can  be  viewed 
as  a  hill-climbing  process  through  the  policy  space  of  sta¬ 
tionary  policies,  i.e.  the  result  of  each  step  in  policy  itera¬ 
tion  results  in  a  monotonic  improvement  in  policy  until  an 
optimal  policy  is  reached. 
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Any  optimal  policy  will  have  the  property  of  being  stable 
under  a  single  step  of  policy  iteration.  The  special  prop¬ 
erties  of  a  Markov  domain  ensure  the  policy  space  to  be 
well-suited  to  a  hill-climbing  strategy;  there  are  no  “local 
maxima”  or  suboptimal  equilibrium  points  to  contend  with, 
and  all  the  global  maxima  form  a  single  connected  “max¬ 
ima  plateau”  that  can  be  reached  by  starting  a  hill-climbing 
process  from  any  point  in  the  space. 

It  is  also  the  case  that  a  “partial”  policy  iteration,  where 
only  a  subset  of  the  states  that  would  have  policy  changes 
under  a  full  policy  iteration  step  have  their  policy  actions 
changed,  will  also  monotonically  improve  the  policy,  and 
therefore  result  in  effective  hill-climbing.  This  is  the  key 
property  that  makes  MDPs  susceptible  to  RL  techniques;  it 
has  become  the  convention  to  characterise  RL  in  Markov 
domains  as  an  incremental,  asynchronous  form  of  dynamic 
programming  (Watkins,  1989;  Bartoetal.,  1995).  If  thcRL 
method  is  a  1-step  temporal  differences  (TD)  method,  like 
Watkins’  1-step  Q-learning,  the  method  resembles  an  in¬ 
cremental,  asynchronous  form  of  value-iteration.  If  the  RL 
method  is  an  actual  return  or  Monte  Carlo  based  method, 
like  C-Trace  (Pendrith  &  Ryan,  1996)  the  method  more 
closely  resembles  an  incremental,  asynchronous  form  of 
policy  iteration. 

So,  for  an  MDP,  the  optimal  policies  correspond  to  the  pol¬ 
icy  iteration  equilibrium  points  in  the  policy  space.  By  way 
of  contrast,  for  NMDPs  it  is  straightforward  to  demonstrate 
that  suboptimal  policy  iteration  equilibria  are  possible,  and 
subsequently  that  policy  iteration  methods  can  fail  by  get¬ 
ting  “stuck”  in  local  maxima.  Consider  the  NMDP  in  Fig¬ 
ure  1. 

Figure  1  shows  an  NMDP  with  two  actions  available  from 
starting  observation  A,  and  two  actions  available  from  the 
successor  observation  B.'  Both  action  0  and  action  1  from 
observation  A  immediately  lead  to  observation  B  with  no 
immediate  reward.  Action  0  and  action  1  from  observa¬ 
tion  B  both  immediately  lead  to  termination  and  a  reward; 
the  decision  process  is  non-Markovian  because  tbe  reward 
depends  on  what  action  was  previously  selected  from  ob¬ 
servation  A,  according  to  the  schedule  in  Table  1 . 

In  the  policy  space  for  this  NMDP,  the  deterministic  pol¬ 
icy  7t3  is  clearly  optimal,  with  a  total  reward  of  2.  Fur¬ 
ther,  it  represents  an  equilibrium  under  policy  iteration;  if 
states  A  or  B  independently  change  policy,  the  total  reward 
becomes  -2  and  0  respectively.  Notice  that  policy  tiq,  al¬ 
though  clearly  suboptimal  with  a  total  reward  of  1 ,  is  also 

*In  general,  we  will  be  referring  to  the  “observations”  rather 
than  “states”  of  an  NMDP,  as  we  will  be  moving  on  later  to  discuss 
a  specific  class  of  NMDPs  that  are  defined  in  a  POMDP  frame¬ 
work,  where  this  terminological  distinction  becomes  important. 


Figure  1 :  An  NMDP  demonstrating  an  suboptimal  equilibrium. 


A  action  B  action  reward 

0  0  i 

ni  0  1-2 

7t2  1  0  0 

713  1  12 

Table  1 :  Reward  schedule  for  NMDP  in  Figure  1 . 

an  equilibrium;  if  states  A  or  B  independently  change  pol¬ 
icy,  the  total  reward  becomes  0  and  -2  respectively. 

Although  we  have  only  explicitly  considered  determinis¬ 
tic  policies  in  the  above  discussion,  wc  note  that  the  result 
generalises  straightforwardly  to  stochastic  policies. 

In  the  case  of  the  example  above  the  optimal  policy  was 
also  a  deterministic  policy.  However,  it  is  known  that  in 
general  for  NMDPs  there  may  be  be  no  deterministic  policy 
among  the  optimal  policies,  as  will  always  be  the  case  for 
MDPs  (Singh  et  al.,  1994). 

Further,  we  will  show  in  this  paper  that  if  a  TD  method  of 
credit  assignment  is  used,  or  the  rewards  are  discounted,  the 
optimal  policies  may  not  represent  equilibrium  points  in  the 
policy  space,  even  if  an  optimal  deterministic  policy  exists. 
This  means  that  even  if  the  problems  of  local  maxima  are 
overcome,  the  optimal  policies  may  not  be  attractive  under 
some  standard  RL  techniques. 

It  turns  out  the  key  property  of  optimal  policies  being  stable 
under  RL  is  only  preserved  if  the  additional  restrictions  of 
using  undiscounted  rewards  and  using  actual  return  eredit 
assignment  methods  are  imposed. 

3  Learning  Equilibria 

For  the  analysis  of  standard  RL  algorithms  for  NMDPs,  it 
is  useful  for  us  to  introduce  the  notion  of  a  learning  equi¬ 
librium,  an  equilibrium  in  policy  resulting  from  a  particular 
learning  method.  So  just  as  we  can  talk  about  a  policy  that 
is  stable  under  policy  iteration,  we  might  talk  about  a  policy 
that  is  stable  under  1-step  Q-learning,  for  example. 

Definition  1  A  learning  equilibrium  has  the  property  that 
if  you  replace  the  current  state  (or  {state, action) )  value  es¬ 
timates  with  the  expected  value  of  the  those  estimates  given 
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the  current  policy  and  the  learning  method  being  used,  then 
the  policy  remains  unchanged. 

A  learning  equilibrium  can  be  considered  to  be  a  stochas¬ 
tic  fixed  point  in  the  policy  space  with  respect  to  a  given 
learning  method. 

We  consider  that,  in  general,  an  RL  system  will  in  the 
course  of  learning  perform  a  series  of  updates  to  a  set  of 
real-valued  utility  estimators.  These  estimators  will  typi¬ 
cally  estimate  state  value  or  {state, action)  value,  or  some¬ 
times  both.  Further,  we  are  assuming  the  current  policy  of 
an  RL  system  will  be  a  function  of  these  estimator  values, 
which  we  might  write  as  /  :  £  — >^  fl,  where  E  represents 
the  space  of  possible  estimator  values,  and  FI  is  the  policy 
space. 

We  can  also  consider  a  mapping  in  the  reverse  direction 
g  ;  n  ->  £,  where  the  point  g{n)  in  the  estimator  space 
corresponds  to  the  expected  values  of  the  estimators  with 
respect  to  the  learning  rules  and  a  policy  n.  For  example, 
if  the  system  is  a  Q-learning  system,  g  :  fl  — >  £  would  be 
defined  by  the  Q  function,  where  Q^{s,a)  is  the  expected 
value  of  the  {s,a)  estimator  under  policy  n. 

If  we  consider  :  FI  n  to  be  the  composition  of  func¬ 
tions  /  and  g  such  that  h{n)  =  f{g{n)),  then  if  a  policy 
n'  meets  the  fixed  point  condition  n'  =  h{n'),  then  n'  is  a 
learning  equilibrium.  In  this  way  a  learning  equilibrium 
can  be  considered  a  generalisation  of  the  notion  of  a  pol¬ 
icy  that  is  stable  under  policy  iteration.  However,  given 
the  stochastic  nature  of  the  g  mapping,  such  a  fixed  point 
represents  stability  in  terms  of  expection  only. 

For  any  MDP  with  a  total  discounted  reward  optimality  cri¬ 
terion,  the  only  equilibrium  policies  for  any  of  the  RL  or 
DP  methods  discussed  so  far  will  be  optimal  policies.  A 
policy  that  is  stable  under  policy  iteration  is  also  stable  un¬ 
der  value  iteration,  or  under  1-step  Q-learning. 

On  the  other  hand,  in  a  non-Markov  setting  there  may  be 
suboptimal  equilibria  for  RL  systems.  The  example  in  Fig¬ 
ure  1  provides  an  example  of  this  possibility. 

Clearly,  having  a  global  maximum  in  policy  space  which 
is  also  a  learning  equilibrium  is  a  necessary  condition  for 
convergence  to  an  optimal  policy  under  a  given  learning 
method.  This  basic  idea  provides  the  motivation  for  the 
form  of  analysis  that  follows. 

4  hPOMDPs 

The  essence  of  an  NMDP  is  that  the  history  of  states  and 
actions  leading  to  the  present  state  may  in  some  way  in¬ 
fluence  the  expected  outcome  of  taking  an  action  within 


that  state.  When  applying  a  standard  RL  method  like  1- 
step  Q-learning  to  an  NMDP,  the  history  is  not  used  even  if 
available  —  this  is  what  Singh  et  al.  (1994)  call  direct  RL 
for  NMDPs.  Therefore,  one  potentially  useful  approach 
to  modelling  a  general  class  of  NMDPs  is  by  considering  a 
process  that  becomes  Markov  when  the  full  history  of  states 
and  actions  leading  to  the  present  state  is  known,  but  may 
be  only  partially  observable  if  this  history  is  not  available 
or  only  partially  available,  i.e.  the  full  history  is  guaranteed 
to  provide  any  missing  state  information. 

Another  way  of  expressing  this  is  to  say  that  nothing  apart 
from  the  currently  observed  state  information  along  with 
the  history  is  required  to  provide  a  sufficient  statistic.  This 
property  defines  a  class  of  partially  observable  Markov  de¬ 
cision  process  (POMDP)  we  will  call  hPOMDPs  (with  h 
for  history). 

We  should  emphasise  that  in  the  hPOMDP  model  the  full 
history  is  always  sufficient,  but  not  always  necessary,  to 
disambiguate  the  true  state.  hPOMPDs  include  processes 
where  only  some  or  even  none  of  the  history  is  required. 
For  example,  a  fully  Markov  process,  which  requires  no 
history  at  all  to  disambiguate  the  state,  is  included  in  the 
hPOMDP  class.  Another  example  would  be  a  process 
that  only  requires  the  current  observation  plus  the  start¬ 
ing  observation  for  full  state  disambiguation.  Using  a 
POMDP  formulation,  we  can  formalise  the  properties  of  an 
hPOMDP  stated  above  by  requiring  the  existence  of  a  func¬ 
tion  ^{s,h)  that  maps  the  current  observation  s  and  history 
h  into  a  unique  state  in  the  underlying  MDP. 

The  original  motivation  behind  the  formalisation  of 
hPOMDPs  was  to  provide  a  model  for  the  sort  of  non- 
Markovianness  that  is  encountered  when  state  aggregation 
due  to  state-space  representation  or  other  forms  of  state¬ 
aliasing  occur;  usually,  in  cases  like  these,  history  can  make 
the  observation  less  ambiguous  to  some  extent,  and  the 
more  history  you  have  the  more  precisely  you  can  deter¬ 
mine  the  true  state.^  However,  hPOMDPs  may  also  be  used 
to  model  the  more  discrete  kinds  of  perceptual-aliasing 
more  frequently  encountered  in  the  RL  literature,  a  proto- 


^We  note  that  for  some  control  processes,  even  the  entire  his¬ 
tory  is  not  able  to  completely  disambiguate  the  state.  For  exam¬ 
ple,  the  original  noise-  and  disturbance-free  formulation  of  the 
pole-and-cart  problem  using  a  “boxed”  state-space  representation 
(Michie  &  Chambers,  1968;  Barto,  Sutton,  &  Anderson,  1983; 
Pendrith  &  Ryan,  1996)  is  well-modelled  using  hPOMDPs  when 
the  initial  state  of  the  system  is  known  (e.g.  zero  for  all  state  vari¬ 
ables),  but  if  the  initial  state  variables  are  randomised  or  otherwise 
uncertain,  then  access  even  to  the  full  history  may  not  make  the 
true  current  state  unambiguous.  However,  even  in  this  situation, 
we  note  the  history  will  make  the  true  state  less  ambiguous;  and  so 
the  hPOMDP  model  might  be  considered  to  be  a  useful  “limiting 
case”  approximation  for  domains  like  these. 
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typical  example  being  Kaelbling  et  al.’s  “robot  in  the  corri¬ 
dors”  scenario  (Kaelbling,  Littman,  &  Cassandra,  1995).-^ 

5  A  Discounted  Reward  Framework  for 
NMDPs 

Because  we  are  interested  in  what  happens  when  applying 
standard  discounted  reward  RL  methods  like  Q-Iearning 
to  NMDPs,  we  restrict  our  attention  to  the  class  of  fi¬ 
nite  hPOMDPs  (i.e.,  a  hPOMDP  such  that  the  observa¬ 
tion/action  space  5  X  A  is  finite).'*  This  effectively  mod¬ 
els  the  RL  table-lookup  representation  for  which  all  the 
strong  convergence  results  have  been  proven  in  the  context 
of  MDPs. 

5.1  Summing  Over  Histories 

We  consider  a  total  path  or  trace  through  a  finite  hPOMDP 
which  can  be  written  as  a  sequence  of  observation/action 
pairs 

((•^Oj^o) )  {■^1 ) )) '  •  • )  ^()  1  •  •  •) 

where  is  the  pair  associated  with  the  i''‘  time-step 

of  this  path  through  the  system.  For  any  finite  or  infinite 
horizon  total  path  co  there  is  an  associated  total  discounted 
reward 

R{03)  =  f^y'r,  (!) 

/=o 

where  y  €  [0,1]  is  the  discount  factor,  r,  is  the  immediate 
reward  associated  with  taking  action  a,  from  observation 
St,  and  n  is  the  horizon. 

In  measure  theoretic  terms,  we  can  express  the  probability 
P"  of  a  particular  observation  s  ever  being  visited  under 
policy  n  as 

Ps  =  P^'iT.s)  (2) 

where  the  set  T,  is  the  set  of  possible  traces  that  includes 
s,  and  is  a  suitably  defined  probability  measure  over  the 
space  of  all  possible  traces  T  with  respect  to  policy  TC.  We 
can  also  write 

P^^{\-P^)  =  P^i1^) 

where  P^  is  the  complementary  probability  of  observation 
s  not  being  visited,  Tj  being  the  set  of  traces  that  do  not 
include  s.  These  “visit  probabilities”  assume  there  is  a 
distribution  of  starting  observations  V|/  associated  with  an 

^We  note  however  that  for  such  systems  to  be  accurately  mod¬ 
elled  by  hPOMDPs  some  additional  restrictions  on  the  problem 
may  need  to  be  applied,  e.g.  the  initial  state  must  be  known  to  the 
RL  agent. 

‘*Note  that  this  does  not  imply  there  are  only  a  finite  number 
of  states  in  the  underlying  MDP.  (ef.  Singh  et  al.,  1994). 


hPOMDP.  where  ti/,  is  the  a  priori  probability  of  observa¬ 
tion  .s  being  the  initial  ob.servation  of  the  proce.ss. 

We  note  that  in  general,  e.g.  if  the  process  is  non-absorhing, 
a  trace  may  be  of  infinite  length,  and  therefore  the  associ¬ 
ated  probability  of  it  occurring  may  be  infinitesimal,  and 
the  set  Tg  uncountable;  these  considerations  motivate  intro¬ 
ducing  the  techniques  of  measure  theory.-^ 

We  also  note  that  executing  a  trace  that  involves  one  or 
more  visits  to  s  is  logically  equivalent  to  executing  a  trace 
that  involves  a  first  visit  to  s,  and  therefore 

P^=J^pih,K)  (3) 

heH, 

where  Hg  is  the  set  of  finite  length  first-visit  histories,  which 
arc  the  possible  chains  of  observation/action  pairs  leading 
to  a  first  visit  to  observation  x,  and  p{h,n)  is  the  associated 
probability  of  a  first  visit  occurring  by  that  history  under 
policy  71.  Because  h  G  Hg  arc  of  finite  length,  p{h,n)  is 
finite  and  Hg  is  countable,  and  therefore  we  can  express  the 
value  as  a  sum  rather  than  an  integral. 

The  technical  issue  of  defining  an  appropriate  probability 
measure  consistent  with  the  value  of  this  sum  to  enable 
working  with  Lebesgue  integrals  is  dealt  with  in  detail  in 
(Pendrith  &  McGarity,  1997),  where  the  equivalence  of  (2) 
and  (3)  is  used  as  a  starting  point.  However,  it  is  not  nec¬ 
essary  to  immediately  consider  these  details  to  follow  the 
development  of  this  paper. 

5.2  Defining  Analogs  of  Q-value  and  State  Value  for 
hPOMDPs 

A  stochastic  policy  takes  the  form  of  a  set  of  action  se¬ 
lection  distributions,  with  one  distribution  for  each  obser¬ 
vation.  Thus  a  deterministic  policy  can  be  considered  to 
be  a  special  case  of  a  stochastic  policy.  So  for  generality, 
we  define  the  following  hPOMDP  values  with  respect  to 
stochastic  policies. 

We  consider  the  expected  future  discounted  reward  (i.e. 
utility)  of  taking  an  action  randomly  selected  with  respect 
to  a  distribution  d  from  an  observation  s,  with  first- visit  his¬ 
tory  h  and  following  policy  n  thereafter.  We  denote  this  as 
U^{s,d,h).  For  notational  convenience,  we  will  also  write 
U^{s,a,h)  to  represent  the  utility  of  taking  a  particular  ac¬ 
tion  a  from  observation  s  with  history  h  and  following  pol¬ 
icy  7C  thereafter.  (This  can  be  considered  shorthand  where 
a  stands  in  for  the  distribution  that  would  deterministically 
select  a.) 

^For  a  review  of  the  essential  measure  theory  concepts  u.scd  in 
this  paper  sec  e.g.  (Billingsley,  1986). 
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We  note  that  the  values  U^{s,d,h)  and  U'^{s,a,h)  are  both 
well-defined  by  the  definition  of  an  hPOMDP.  U‘^{s,a,h) 
can  be  considered  the  “Q-value”  of  the  underlying  (possi¬ 
bly  infinite  state)  MDP  where  the  action  a  is  taken  from 
“true”  state  U^{s,d,h)  would  therefore  be  a 

weighted  average  of  these  Q-values  for  that  state. 


A  value  that  is  of  interest  if  we  are  considering  what  can 
be  learned  by  applying  standard  RL  methods  directly  to 
hPOMDPs  is  the  following  weighted  average  of  the  above 
defined  utilities 


Q\s,d) 


lHeHs^U^s,d,h)  ifP^>0 
undefined  if  Pj'  =  0 


Extending  our  shorthand  notation  introduced  above,  we 
will  also  write 


ifP;> 

undefined  if  Pf  = 


Q^{s,a)  is  what  might  be  called  the  “observation  first- visit 
Q-value”;  we  observe  it  is  the  value  a  first- visit  Monte 
Carlo  method  will  associate  with  taking  action  a  from  ob¬ 
servation  s  in  the  hPOMDP.®  Similarly,  Q^{s,d)  is  the  ex¬ 
pected  value  a  first- visit  Monte  Carlo  method  will  come  to 
associate  with  selecting  an  action  using  distribution  d  from 
observation  s. 

Using  the  definitions  above,  we  define  the  value  of  an  ob¬ 
servation  for  a  policy  to  be  V^{s)  =  where  Ks  is 

the  action  selection  distribution  associated  with  observation 
s  under  stochastic  policy  7i;  if  7t  represents  a  deterministic 
policy,  then  denotes  the  policy  action  for  observation  s. 

We  note  that  the  values  of  Q^{s,d),  Q^{s,a)  and  hence 
y’'{s)  are  undefined  for  j  if  Pj'  =  0  (i.e.,  s  is  unreachable 
under  7t).  This  is  because,  unlike  the  case  for  MDPs,  it  is 
difficult  to  assign  a  sensible  meaning  to  the  notion  of  the 
value  of  taking  an  action  from  an  unreachable  observation. 
In  short,  the  notion  of  an  “observation  first-visit  Q-value” 
is  fairly  empty  if  a  first  visit  simply  isn’t  possible. 


^Recall  that  Hs  is  a  set  of  first-visit  histories.  We  consider 
first-visit  rather  than  multiple-visit  Monte  Carlo  methods  because 
there  are  some  basic  conceptual  problems  with  using  the  latter  in  a 
non-Markovian  context  (in  the  general  case,  it  doesn’t  make  sense 
to  apply  multiple-vi.sit  Monte  Carlo  when  histories  may  matter, 
i.e.  the  Markov  assumption  doesn’t  hold.)  For  an  introduction  to 
concepts  of  first-visit  versus  multiple-visit  Monte  Carlo  methods 
as  applied  to  RL,  see  (Singh  &  Sutton,  1996). 


5.3  Policy  Values  for  hPOMDPs 

We  can  write  the  policy  value,  or  total  expectation,  of  an 
hPOMDP  in  terms  of  a  Lebesgue  integral 


7(71)=/’  P((o)t/P"(to) 

J(aeT 


(4) 


integrating  over  total  paths. 

We  can  further  decompose  the  total  expectation  into  a  con¬ 
ditional  expectation  component  that  involves  observation  s 
and  another  that  is  independent  of  change  to  the  policy  for 
observation  s  in  the  following  expression: 


Note  that  for  a  general  discounted  reward  structure  we  can 
write 


/  /?(co)^fP"(co)  =  X  Pih,n)[R{h)+y‘>'U'^is,ns,h)] 

(6) 


where  0  <  y  <  1  is  the  discount  factor,  /*  is  the  length  of 
history  h,  and  R{h)  is  the  value  of  the  truncated  discounted 
return  associated  with  history  h  (cf.  Equation  (1)).  Thus, 
the  LHS  and  RHS  of  this  identity  are  different  expressions 
for  the  conditional  expectation  assuming  a  visit  to  observa¬ 


tion  s. 


Finally,  we  define  an  optimal  observation-based  policy  7t* 
simply  by 

71*  e  argmax7(7i)  (7) 

71 


These  definitions  provide  a  framework  for  analysing 
hPOMDPs  using  a  total  future  discounted  reward  criterion 
which  applies  equally  well  to  both  ergodic  and  non-ergodic 
systems. 


6  Analysis  of  Observation-Based  Policy 
Learning  Methods  for  hPOMDPs 

The  first  result  we  present  is  a  lemma  useful  in  the  proof  of 
Theorem  1.  The  proof  of  the  lemma  is  omitted  for  space 
reasons;  for  the  proof  see  (Pendrith  &  McGarity,  1997). 

Lemma  1  If  two  observation-based  policies  n  and  n  for 
an  undiscounted  hPOMDP  differ  only  in  one  observation 
s,  then  the  difference  in  values  between  the  policies  n  and 
n  can  be  expressed  as 

Jik)-m  =  Pf[vHs)-V^{s)]  (8) 
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We  note  Lemma  1  has  a  strong  intuitive  basis,  suggesting 
its  applicability  to  a  very  general  class  of  decision  pro¬ 
cesses  including  but  not  limited  to  hPOMDPs.  Equation  (8) 
corresponds  to  the  straightforward  observation  that  for  an 
undiscounted  reward  process,  by  changing  policy  in  exactly 
one  reachable  state  under  policy  n,  the  change  in  value  of 
the  expected  total  reward  for  the  new  policy  is  equal  to  the 
change  in  first-visit  expected  value  for  the  changed  state 
multiplied  by  the  a  priori  probability  that  state  will  have  a 
first-visit  under  policy  n. 

Theorem  1  If  a  first-visit  Monte  Carlo  method  of  credit  as¬ 
signment  is  used  for  an  hPOMDP  where  Y  =  1,  then  the 
optimal  observation-based  policies  will  be  learning  equi¬ 
libria. 

Proof  Suppose  an  optimal  observation-based  policy  Jt  is 
not  a  learning  equilibrium  under  a  first-visit  Monte  Carlo 
credit  assignment  method;  then  there  must  exist  an  obscr- 

.V  ^ 

vation  j  such  that  V^{s)  >  V^{s)  for  some  policy  k  that 
differs  from  n  only  in  observation  s.  By  Lemma  1,  the  dif¬ 
ference  in  policy  values  is 

J{K)-J{n)  =  Pf[vHs)-V^{s)] 

Since  >  V’^(s)  and  Pf  >  0  (i.e.  observation  5  is 
reachable  under  n),'’  then  J{n)  >  J{k).  But  this  is  not  pos¬ 
sible  since  n  is  an  optimal  policy;  hence  an  optimal  policy 
is  a  learning  equilibrium.  □ 

Theorem  1  is  a  positive  result;  it  shows  that,  at  least  under 
certain  restricted  conditions,  an  optimal  observation-based 
policy  is  also  guaranteed  to  represent  a  policy  equilibrium 
for  a  direct  RL  style  learner. 

The  next  question  is  whether  we  can  generalise  the  result. 
Does  the  result  hold  for  general  Y?  Docs  the  result  hold  for 
TD  returns  instead  of  Monte  Carlo  style  “roll-outs”? 

The  next  result  addresses  the  issue  of  using  discounted  re¬ 
turns  for  general 

Theorem  2  Theorem  I  does  not  generalise  tojE  [0, 1 ). 

Proof  We  prove  this  by  providing  a  counter-example.  We 
consider  the  hPOMDP  in  Figure  2. 

Figure  2  shows  an  hPOMDP  with  one  action  available  from 
the  two  equiprobable  starting  observations  A  and  B;  one 

^Note  that  observation  s  must  be  reachable  under  both  7t  and 

S  ^ 

n  otherwise  both  V’'(j)  and  V^{s)  would  be  undefined,  which  is 

S 

incompatible  with  the  hypothesis  V"(.)  >  v-w. 


Figure  2;  The  hPOMDP  discussed  in  the  proof  of  Theorem  2. 

action  available  from  intermediate  observation  C;  and  two 
actions  available  from  the  penultimate  observation  D.  An 
action  from  observation  A  leads  to  observation  C  without 
reward;  actions  from  observations  B  and  C  lead  to  ob.scrva- 
tion  D  without  reward.  Both  action  0  and  action  1  from  ob¬ 
servation  D  immediately  lead  to  termination  and  a  reward; 
the  decision  process  is  non-Markovian  because  the  reward 
depends  not  only  on  the  action  taken  from  observation  D, 
but  also  on  tbc  starting  observation. 

We  assume  that  y  <  1  for  this  discounted  reward  decision 
process;  suppose  the  reward  schedule  is  as  follows: 

Start  observation  action  D  reward 
A  0  n 

A  1  r2 

B  0  rj 

B  1  r4 

Let  Tto  and  7ii  be  the  policies  that  correspond  to  0  and  1 
being  tbe  policy  action  from  D.  We  set  r;  ...r4  such  that 
Q”^{D,0)  >  1)  for  arbitrary  71  (i.e.  (n  ■krf)/!  >  (r2  + 

r4)/2),  but  also  so  that  J{tio)  <  /(tii)  (i.e.  {yr\  -f  rf)/!  < 
{Tz  -t-  rf)l2).  For  example,  let  n  =  r^  =  \,  r^  =  2,  and 
select  r\  such  that  yri  <  1  <  ri . 

In  such  a  case,  D  will  see  action  0  as  preferable,  which 
appears  locally  optimal  even  though  the  choice  results  in 
suboptimal  policy  tiq.  Thus  the  sole  optimal  policy  Tti  docs 
not  represent  a  learning  equilibrium  for  this  hPOMDP.  □ 

Next,  we  examine  the  case  where  TD  style  returns  arc  used: 

Theorem  3  If  a  TD(X)  credit-assignment  method  is  used 
for  direct  RL  of  a  NMDP,  then  for  ^  <  1  /r  is  not  guaranteed 
there  exists  an  optimal  observation-based  policy  represent¬ 
ing  a  learning  equilibrium. 

Proof  Consider  the  hPOMDP  in  Figure  3.  Observations  A 
and  B  are  the  equiprobable  starting  observations.  We  note 
all  the  transitions  arc  deterministic,  and  that  in  observation 
A  there  are  two  actions  to  select  from  while  observations 
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Figure  3:  The  hPOMDP  discussed  in  the  proof  of  Theorem  3. 


B  and  C  have  only  one.  Action  0  from  observation  C  leads 
directly  to  termination  with  an  immediate  reward;  if  the 
starting  observation  is  A,  the  immediate  reward  is  1,  but  if 
the  starting  observation  is  B,  the  immediate  reward  will  be 
zero.  Action  0  from  observation  A  also  has  a  termination 
and  a  non-zero  immediate  reward  associated  with  it,  the 
exact  value  of  which  we  will  discuss  in  a  moment.  All 
other  transitions  have  a  zero  immediate  reward  associated 
with  them. 

The  expected  value  of  (C,0)  for  an  observation  based  pol¬ 
icy  n  depends  upon  the  relative  frequency  of  the  transitions 
A  ->  C  and  B-^C;  this  in  turn  depends  upon  how  often  ac¬ 
tion  1  is  selected  from  observation  A  for  the  sake  of  active 
exploration.  We  make  no  special  assumptions  regarding 
an  active  exploration  strategy:  we  only  assume  the  relative 
frequencies  of  action  0  and  action  1  selections  from  obser¬ 
vation  A  are  both  non-zero;  hence  Q'^{C,0)  €  (0,0.5). 

From  the  rules  of  TD  updates  we  can  derive  that  Q”'{A,\)  = 
Y(^.1  -H  (1  -  X)(2"(C,0)),  assuming  y  €  [0, 1].  This  inter¬ 
ests  us,  because  Q^{A,\)  would  equal  y  under  a  Monte 
Carlo  method  of  credit  assignment,  but  for  TD(A,)  returns 
(2"(A,  1)  <  y  for  all  X  <  1 . 

Therefore,  if  the  value  of  the  immediate  reward  for  (A,0) 
is  such  that  2" (A,  1)  <  2’'(A,0)  <  y,  then  observation  A 
would  see  action  0  as  preferable  to  action  1,  even  though 
the  optimal  policy  corresponds  to  selecting  action  1.  In 
such  a  case,  the  optimal  observation-based  policy  for  this 
hPOMDP  does  not  represent  a  learning  equilibrium  if 
TD(X)  returns  are  used  with  A,  <  1 .  □ 


Corollary  1  If  a  1-step  Q-leaming  {or  1-step  SARSA) 
method  of  credit  assignment  is  used  for  direct  RL  of  a 
NMDP,  then  it  is  not  guaranteed  that  there  exists  an  opti¬ 
mal  observation-based  policy  representing  a  learning  equi¬ 
librium. 

We  note  we  can  also  use  the  proof  of  Theorem  3  to  settle  a 
conjecture  in  (Singh  et  al.,  1994)  regarding  the  optimality 
of  Q-leaming  for  observation-based  policies  of  POMDPs. 
The  authors  of  that  paper  conjectured  that  Q-leaming  in 
general  might  not  be  able  to  find  the  best  deterministic 
memoryless  (i.e.  observation-based)  policy  for  POMDPs. 
If  we  consider  A  =  0  case  (i.e.,  the  case  corresponding  to  1- 
step  Q-learning),  this  result  follows  directly  from  the  proof, 
since  the  optimal  policy  for  the  hPOMDP  used  in  the  proof 
of  Theorem  3  is  in  fact  also  deterministic. 

We  also  note  that  in  (Pendrith  &  McGarity,  1997)  is  a  proof 
that  extends  these  results  from  1-step  to  multi-step  cor¬ 
rected  truncated  returns  (CTRs).  We  omit  the  proof  here 
for  space  reasons. 

Taken  together,  these  results  show  that  the  key  property  of 
optimal  observation-based  policies  being  stable  for  direct 
RL  methods  does  not  generalise  from  Markovian  to  non- 
Markovian  domains.  The  stability  of  optimal  observation- 
based  policies  under  standard  RL  methods  can  be  guaran¬ 
teed  for  hPOMDPs,  a  general  class  of  NMDPs,  only  if  the 
additional  restrictions  of  using  undiscounted  rewards  and 
using  actual  return  credit  assignment  methods  are  imposed. 
These  results  apply  for  stochastic  as  well  as  for  determin¬ 
istic  optimal  observation-based  policies. 

7  Related  work 

The  POMDP  theoretical  framework  was  originally  formu¬ 
lated  in  the  context  of  a  set  of  operations  research  (OR) 
problems;  the  wider  RL  literature  reflects  an  important  line 
of  research  that  is  bringing  OR  methods  to  bear  on  the  gen¬ 
eral  problem  of  discovering  effective  policies  in  partially 
observable  stochastic  domains  (Kaelbling  et  al.,  1995).  In 
contrast  to  “direct”  methods  of  RL  for  POMDPs,  however, 
these  methods  generally  rely  on  state-estimation  techniques 
that  attempt  to  disambiguate  observations  into  true  Markov 
states. 


Firstly,  we  should  point  out  that  proof  of  Theorem  3  has 
been  constructed  so  that  it  applies  equally  to  both  on-policy 
methods,  such  as  SARSA  (e.g.  Singh  &  Sutton,  1996),  and 
to  off-policy  methods,  such  as  Q-leaming. 

Further,  if  we  consider  the  special  case  of  A  =  0,  we  can 
use  this  proof  to  additionally  arrive  at  the  following  result: 


Although  an  analysis  of  direct  RL  for  POMDPs  as  pre¬ 
sented  in  this  paper  might  prima  facie  seem  to  have  lit¬ 
tle  bearing  on  such  approaches,  this  is  not  necessarily  the 
case.  We  consider  that  even  if  we  are  using  active  state- 
estimation  techniques  in  a  POMDP  setting,  the  problem 
will  remain  non-Markov  to  some  degree  or  another  while 
the  state-estimation  is  imperfect;  and,  in  general,  the  prob¬ 
lem  of  state-disambiguation  has  been  shown  to  be  difficult 
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(Littman,  1996). 

In  (Littman,  1994)  is  a  complexity  analysis  of  the  general 
problem  of  finding  the  optimal  deterministic  memoryless 
(i.e.,  observation-based)  policy  for  an  NMDP.  In  the  gen¬ 
eral  case,  this  turns  out  to  be  NP-complete.  More  opti- 
mistieally,  in  the  same  paper  there  is  evidence  presented 
that  heuristie  methods  for  searching  the  policy  space  might 
be  expected  to  find  very  good  or  even  optimal  policies  in 
the  average  case. 

In  (Singh  et  al.,  1994)  is  proposed  a  framework  for  the  anal¬ 
ysis  of  direct  RL  for  NMDPs;  it  is  built  around  a  class  of 
POMDPs  conceptually  similar  to  hPOMDPs  in  several  im¬ 
portant  respects. 

The  authors  analyse  what  two  different  1  -step  TD  RL  meth¬ 
ods  (TD(0)  and  1-step  Q-learning)  will  learn  as  value  func¬ 
tions  for  the  class  of  POMDPs  under  consideration.  While 
they  do  not  continue  to  a  full  analysis  of  TD(X.)  for  general 
A.  <  1,  they  do  point  out  that  a  Monte  Carlo  method  like 
TD(1)  will  result  in  accurate  value  estimates  for  an  exam¬ 
ple  POMDP  they  analyse. 

As  noted  earlier,  they  conjecture  that,  in  general,  1-step  Q- 
learning  is  not  guaranteed  to  learn  even  the  best  determinis¬ 
tic  observation-based  policy  for  a  POMDP.  However,  their 
analyses  are  concentrated  on  the  issue  of  the  accuracy  of 
observation-based  value  estimation,  rather  than  on  the  sta¬ 
bility  of  optimal  policies,  which  has  been  our  primary  fo¬ 
cus.  Also,  as  a  consequence  of  the  multiple-visit  definition 
of  V{s)  in  their  framework,  their  analysis  was  necessarily 
restricted  to  the  asymptotic  behaviour  of  ergodic  systems,  a 
limitation  which  does  not  apply  to  the  framework  presented 
here. 

The  analyses  of  cooperative  learning  automata  in  Markov 
settings  by  Witten  (1977)  and  Wheeler  &  Narendra  (1986) 
provided  the  game  theoretic  perspective  facilitating  the 
original  intuitions  and  reasoning  leading  to  the  results  pre¬ 
sented  in  this  paper. 

8  Conclusions  and  Future  Work 

An  analysis  of  hPOMDPs  has  proven  to  be  an  aid  to  un¬ 
derstanding  the  theoretical  implications  of  applying  stan¬ 
dard  discounted  reward  RL  methods  to  non-Markov  en¬ 
vironments.  Extending  earlier  work,  the  framework  we 
present  applies  to  non-ergodie  as  well  as  discounted  reward 
NMDPs,  faeilitating  a  more  direct  understanding  of  the  is¬ 
sues  involved. 

Our  analysis  starts  with  the  simple  observation  that  having 
a  global  maximum  in  poliey  space  which  is  also  a  learn¬ 
ing  equilibrium  is  a  necessary  condition  for  convergence 


to  an  optimal  policy  under  a  given  learning  method.  We 
discover  that  for  an  important  general  class  of  non-Markov 
domains,  undiscounted,  actual  return  RL  methods  have  sig¬ 
nificant  theoretical  advantages  over  diseounted  returns  and 
TD  methods  of  credit-assignment. 

A  move  from  discounted  to  undiseounted  rewards  natu¬ 
rally  suggests  a  closer  look  at  average  reward  RL  methods 
for  equilibrium  properties  in  non-Markov  environments. 
Some  steps  in  this  direction  have  already  been  made  in 
(Singh  et  al.,  1994)  and  (Jaakkola,  Singh,  &  Jordan,  1995). 
Theorem  2  may  point  to  subtle  problems  translating  “tran¬ 
sient  reward”  sensitive  metrics  such  as  Blackwell  optimal¬ 
ity  (Mahadevan,  1996)  from  MDPs  to  NMDPs.  Investiga¬ 
tions  are  continuing  in  this  direction. 
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Abstract 

Three  factors  are  related  in  analyses  of  per¬ 
formance  curves  such  as  learning  curves:  the 
amount  of  training,  the  learning  algorithm,  and 
performance.  Often  we  want  to  know  whether 
the  algorithm  affects  perfonnance  and  whether 
the  effect  of  training  on  performance  depends  on 
the  algorithm.  Analysis  of  variance  would  be  an 
ideal  technique  but  for  carryover  effects,  which 
violate  the  assumptions  of  parametric  analysis 
of  variance  and  can  produce  dramatic  increases 
in  lype  I  errors.  We  propose  a  novel,  random¬ 
ized  version  of  the  two-way  analysis  of  variance 
which  avoids  this  problem.  In  experiments  we 
analyze  Type  I  errors  and  the  power  of  our  tech¬ 
nique,  using  common  machine  learning  datasets. 


1  INTRODUCTION 

A  common  task  in  machine  learning  is  comparative  assess¬ 
ment  of  learning  methods.  Most  research  on  this  issue  fo¬ 
cuses  on  perfonnance  measures  such  as  classification  accu¬ 
racy  after  training,  or  percentage  of  games  won  by  a  game¬ 
playing  program  (e.g.  Mitchell  1997  ch.  5,  Dietterich  (in 
press),  Rasmussen  et  al.  1996).  However,  it  is  sometimes 
interesting  to  compare  time  series  of  performance,  such  as 
learning  curves.  For  example,  two  algorithms  might  have 
comparable  asymptotic  performance,  but  we  would  like  to 
test  the  hypothesis  that  one  achieves  this  level  of  perfor¬ 
mance  more  quickly  than  the  other. 

Which  statistical  procedures  are  appropriate  to  identify  dif¬ 
ferences  between  the  performance  of  algorithms  over  time, 
and  particularly  during  training?  One  obvious  approach 
might  be  to  apply  the  aforementioned  methods  repeatedly 


at  different  times,  comparing  the  performance  of  algo¬ 
rithms  at  each  of  several  levels  of  training.  Unfortunately, 
multiple  comparisons  can  lead  to  overestimates  of  the  sig¬ 
nificance  of  results  (see  Section  2)  and  are  inappropriate  for 
comparing  performance  curves. 

A  better  approach  is  to  describe  differences  between  algo¬ 
rithms  during  training  in  terms  of  two  effects: 

Algorithm  Effect:  Does  one  algorithm  generally  achieve 
higher  performance  than  another? 

Interaction  Effect:  Does  the  influence  of  training  on  per¬ 
formance  depend  on  the  algorithm? 

Figures  la  and  lb  illustrate  prototypical  cases  for  each  ef¬ 
fect.  In  practice,  however,  some  combination  of  effects 
will  occur.  In  Figure  Ic,  for  instance,  both  curves  start  out 
with  similar  slopes,  but  one  of  them  converges  to  a  lower 
asymptote.  Figure  Id  shows  a  case  where  both  curves  start 
at  the  same  point  and  achieve  similar  asymptotic  perfor¬ 
mances,  but  one  algorithms  learns  faster  (with  respect  to 
the  amount  of  training)  than  the  other.  In  this  latter  case,  we 
find  that  both  algorithm  and  interaction  effects  concentrate 
in  the  early  stages  of  training,  and  both  effects  essentially 
disappear  with  increasing  amount  of  training. 

This  paper  presents  a  method  for  detecting  Algorithm  and 
Interaction  effects  in  learning  curves.  Actually,  the  method 
is  not  restricted  to  learning  curves,  it  applies  to  any  kind  of 
performance  curves.  The  method  tests  two  hypotheses: 

•  The  mean  performances  of  two  or  more  algorithms  are 
the  same  (no  Algorithm  effect). 

•  The  relationship  between  training  and  performance 
does  not  depend  on  Algorithm  (no  Interaction  effect). 

Such  effects  are  typically  tested  with  analysis  of  variance 
(ANOVA).  However,  the  conventional  parametric  ANOVA  is 
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Figure  1:  Some  kinds  of  differences  between  learning  curves.  The  statistical  effects  on  performance  (Algorithm  and/or 
Interaction  effects)  are  listed  for  each  situation.  In  case  c,  the  Interaction  effect  disappears  at  the  later  stages  of  training;  in 
case  d,  both  effects  disappear. 


based  on  several  assumptions,  of  which  one,  homogene¬ 
ity  of  covariance,  is  strongly  violated  by  most  time  se¬ 
ries  data.  In  particular,  conventional  ANOVAs  on  learning 
curves  can  dramatically  overestimate  the  significance  of  al¬ 
gorithm  effects  and  underestimate  the  significance  of  in¬ 
teraction  effects.  Following  some  statistical  preliminaries 
in  Section  2,  we  demonstrate  how  ANOVA  gives  incorrect 
results  for  learning  curves  (Section  3)  and  then  introduce 
our  novel  procedure,  a  randomized  version  of  ANOVA  (Sec¬ 
tion  4).  The  remainder  of  the  paper  presents  experimental 
results  with  conventional  and  randomized  ANOVA,  compar¬ 
ing  the  power  and  Type  I  errors  of  the  methods. 

2  STATISTICAL  HYPOTHESIS  TESTING 

This  section  defines  terms  and  may  safely  be  skipped  by 
readers  familiar  with  statistical  hypothesis  testing. 

Hypothesis  testing  involves  these  steps:  Assert  a  null  hy¬ 
pothesis  Hq.  Decide  on  a  statistic  (f.  Collect  a  sample  s 
of  size  n  and  calculate  (f>{s)  for  the  sample.  Derive  the 
probability  distribution  S  of  all  possible  values  of  <^(i)  for 
samples  i  of  size  n  under  Hq.  These  restrictions  are  im¬ 
portant:  S  isn’t  the  distribution  of  <j)  for  any  sample,  but 
for  samples  of  size  n  that  would  arise  if  the  null  hypoth¬ 
esis  were  true.  S  is  called  the  sampling  distribution  of  (f. 
One  may  then  ask,  “What  is  the  probability  of  obtaining  a 
statistic  value  of  (j){s)  or  more  by  chance  if  Ho  were  true?” 
The  answer,  called  a  p  value,  is  the  area  of  S  above  (^(s). 
Suppose  p  =  .01.  Should  you  reject  the  null  hypothesis? 
There  isn’t  a  correct  answer  to  this  question,  but  you  can  be 
assured  that  if  you  do  reject  Ho,  the  probability  that  you  do 
so  in  error  is  no  greater  than  p.  Rejecting  Ho  when  it  is  true 
is  called  a  Type  I  error.  Failing  to  reject  Ho  when  it  is  false 
is  a  Type  II  error,  and  the  power  of  a  test — the  probability 
that  you  will  reject  Ho  when  it  is  false — is  one  minus  the 
probability  of  a  Type  11  error. 


One  may  also  ask,  “Wbat  value  of  <i){s)  must  I  exceed  to 
be  assured  that  my  p  value  is  less  than  some  threshold  a?” 
This  is  called  the  critical  value  of  <j)  and,  obviously,  it  varies 
with  a. 

One  should  not  compare  performance  curves  by  repeatedly 
comparing  points  on  the  curves  (e.g.,  comparing  perfor¬ 
mance  after  i,  2i,  3i . . .  training  instances).  Each  compari¬ 
son  will  with  some  probability  a  assert  a  difference  in  per¬ 
formance  when  in  reality  there  is  none  —  a  Type  I  error. 
If  the  comparison  procedure  is  applied  m  times,  to  m  pairs 
of  points  on  learning  curves,  then  the  total  probability  of 
TVpe  I  error  is  roughly  1  —  (1  —  a)"*.  (The  probability  is 
exactly  1  —  (1  —  a)”*  if  the  comparisons  are  independent, 
but  they  are  not,  and  their  non-independence  necessitates 
the  technique  developed  in  this  paper.)  One  can  control  the 
total  probability  of  a  Type  I  error,  but  only  by  reducing  a 
—  which  increases  the  critical  values  for  individual  com¬ 
parisons  —  making  it  less  likely  that  comparisons  will  find 
differences  that  actually  exist.  Said  differently,  the  power 
of  the  tests  is  reduced  (see  Cohen  1995  for  a  discussion  of 
related  issues).  Multiple  comparisons  are  not  the  right  tool 
for  comparing  performance  curves. 

3  ANOVA  FOR  COMPARING 
PERFORMANCE  CURVES 

Suppose  we  have  two  learning  algorithms  Ai  and  A2,  each 
of  which  trains  I  times  on  a  set  of  k  instances,  e.g.,  in  an 
/-fold  cross  validation  procedure.  Then  we  have  /  estimates 
of  the  performance  of  each  algorithm  at  each  level  of  train¬ 
ing.  Put  another  way,  we  have  /  “lines”  for 

Ai  and  another  /  lines  where  each  line  is 

a  list  of  k  numbers  that  represent  toe  performance  of  the 
algorithm  at  level  h(l  <  h  <  k)  of  training,  on  that  par¬ 
ticular  fold  of  the  cross  validation.  A  schematic  data  table 
is  shown  in  Figure  2,  where  the  axes  of  the  table  represent 
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the  factors  Training  and  Algorithm.  Lines  may  of  course 
be  generated  by  methods  other  than  cross-validation;  for 
example,  they  might  represent  training  on  several  differ¬ 
ent  datasets.  The  important  thing  is  that  the  data  points  on 
a  line  are  not  independent.  In  statistical  parlance,  they  are 
repeated  measures  and  they  create  carryover  effects,  mean¬ 
ing  that  the  performance  represented  by  earlier  points  on  a 
line  influences,  or  carries  over  to,  later  performance. 
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Figure  2:  Data  table  setup  for  randomized  ANOVA.  This 
example  shows  I  =  4  learning  curves  per  algorithm. 

Were  it  not  for  these  carryover  effects,  analysis  of  variance 
would  be  an  ideal  tool  to  analyze  learning  curves.  Analysis 
of  variance  tests  for  main  effects  of  factors  and  interaction 
effects  between  factors.  Each  kind  of  effect  is  represented 
by  an  F  statistic,  which  has  an  expected  value  of  1 .0  under 
the  null  hypothesis  of  no  effect.  Formulae  for  calculating 
F  are  straightforward  and  widely  available  (e.g.,  see  Cohen 
1995)  and  will  not  be  repeated  here.  The  patterns  of  data  in 
Figure  1  can  be  discriminated  by  F  statistics  for  main  and 
interaction  effects. 

Carryover  effects  make  it  difficult  to  specify  the  sampling 
distributions  of  F  statistics.  Classical  F  distributions  are 
derived  under  some  assumptions,  and  while  F  tests  are  ro¬ 
bust  against  departures  from  most  of  these,  learning  curves 
violate  an  important  one:  homogeneity  of  covariance.  To 
see  what  this  means,  note  that  we  could  calculate  a  correla¬ 
tion  between  the  four  data  points  in  the  Ai ,  ti  cell  of  Figure 
2  and  the  four  in  the  Ai ,  cell.  Under  homogeneity  of  co- 
variance,  this  correlation  would  be  constant  for  any  pair  of 
cells  Ak,ti  and  Ak,tj.  However,  the  correlation  between 


performance  after  t  and  f  -t- 1  training  instances  is  apt  to  be 
higher  than  the  correlation  between  performance  after  t  and 
t  -t-  100  instances,  so  homogeneity  of  covariance  is  apt  to 
be  violated.  The  consequence  is  that  the  Type  I  error  prob¬ 
abilities  no  longer  correspond  to  the  given  a  level  (Cohen 
1995  (p.  306),  Keppel  1973,  O’Brien  and  Kaiser  1985). 

So  F  statistics  can  represent  the  effects  in  Figure  1,  nicely, 
but  carryover  effects  bias  the  p  values  of  the  statistics.  Can 
we  salvage  ANOVA  and  F  tests?  One  common  tactic  is  to 
correct  statistics  to  compensate  for  biases.  The  following 
experiment  (and  those  in  Sec.  5)  shows  that  this  tactic  will 
not  work.  We  generated  learning  curves  from  three  dif¬ 
ferent  datasets  (Chess,  RL,  and  Tic-Tac-Toe;  see  the  Ap¬ 
pendix).  The  results  (Figure  3)  demonstrate  a  dramatic  in¬ 
crease  in  Type  I  error  in  the  case  of  Algorithm  effects,  and 
a  decrease  for  Interaction  effects.  The  histograms  demon¬ 
strate  that  the  frequencies  of  these  errors  depend  on  the 
dataset,  which  implies  that  one  cannot  correct  the  F  statis¬ 
tics  with  a  simple  adjustment.  In  particular,  the  Chess  and 
Tic-Tac-Toe  learning  curves  were  generated  according  the 
same  procedure,  their  degrees  of  freedom  are  identical,  and 
yet  their  mean  rejection  rates  differ  dramatically. 

Another  way  to  salvage  ANOVA  is  to  somehow  find  the  ap¬ 
propriate  sampling  distributions  for  F  statistics  when  ho¬ 
mogeneity  of  covariance  is  violated.  This  would  allow  us 
to  control  Type  I  errors  precisely.  Our  method,  discussed 
in  Section  4,  yields  these  sampling  distributions,  and  ac¬ 
curate  p  values,  whether  or  not  homogeneity  of  covariance 
is  violated.  The  procedure  is  based  on  randomization  (see, 
e.g.,  Cohen  1995,  ch.  5).  Consider  first  the  null  hypothe¬ 
sis  that  Algorithm  has  no  effect  on  performance.  If  it  were 
true,  then  the  lines  associated  with  algorithm  Ai  in  Figure  2 
might  equally  well  be  associated  with  A2,  or  with  any  other 
algorithm.  Thus,  if  we  randomly  redistribute  lines  among 
algorithms,  and  then  calculate  Faig  in  the  usual  way,  we 
will  derive  one  value  of  Faig  under  the  null  hypothesis  that 
Algorithm  is  independent  of  performance.  For  clarity,  de¬ 
note  this  statistic  to  remind  us  that  it  was  derived  by 
randomization,  that  is,  shuffling  lines,  and  to  distinguish 
it  from  the  sample  statistic  Faig  that  was  calculated  from 
the  original  (unshuffled)  data  table.  If  we  shuffle  the  lines 
again,  we  will  get  another,  somewhat  different  value  of 
^alg-  and  if  we  shuffle  1000  times  we  can  get  a  distribu¬ 
tion  of  ICKX)  values  of  this  statistic. 

By  shuffling  lines  instead  of,  say,  individual  data  points 
among  algorithms,  we  preserve  the  dependencies  among 
the  data  points  on  each  line.  Said  differently,  we  treat  a  line 
as  a  unit  for  the  purpose  of  estimating  the  distribution  of 
F^ig,  so  the  degree  of  dependence  among  the  data  on  a  line 
is  irrelevant.  As  mentioned  above,  when  homogeneity  of 
covariance  is  violated,  comparing  Faig  to  a  conventional  F 
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Initialize  c  =  0.  Then  do  1000  times: 

1 .  Generate  a  set  L  of  learning  curves  using  C4.5. 

2.  Partition  L  randomly  into  Li  and  L2  representing  two  different  imaginary  algorithms, 
withjLil  =  IL2I  = 

3.  Perform  conventional  ANOVA  on  these  data,  obtaining  the  probability  p  that  it  is  incor¬ 
rect  to  reject  the  null  hypothesis  that  there  is  no  effect  of  Algorithm  on  performance. 

4.  If  p  <  0.05  then  increment  c. 


Chess  RL  Tic-Tac-Toe 


Figure  3:  Illustration  of  the  increase  in  Type  I  error  resulting  from  carryover  effects.  For  each  dataset,  the  procedure 
given  above  was  executed  100  times  and  the  resulting  c  values  averaged.  Without  carryover  effects,  one  would  expect 
c  =  1000a  =  50.  The  histograms  of  c  values  show  that  Hq  was  rejected  much  more  frequently,  which  demonstrates  the 
inappropriateness  of  the  conventional  ANOVA  for  comparison  of  learning  curves.  See  the  Appendix  for  details  about  the 
datasets  used. 


distribution  will  underestimate  p,  that  is,  it  wilt  make  Faig 
look  significant  at  a  given  level  of  a  when  it  is  not.  The 
distribution  of  protects  against  this  error,  as  illustrated 
by  Figure  4. 

is  not  technically  a  sampling  distribution  but  it  serves 
the  same  purpose,  namely,  to  estimate  a  p  value  for  a  sam¬ 
ple  result,  or  to  find  a  critical  value  that  Faig  must  exceed 
to  reject  Ho  with  some  level  a  of  confidence  (Cohen  1995, 
p.  175). 

4  THE  PROCEDURE  IN  DETAIL 

Consider  a  set  A  of  m  learning  algorithms  A\,...,  A^- 
For  each  algorithm  Ai  we  have  a  set  of  I  learning 
curves  .  Each  learning  curve  constitutes 

a  fc-tuple  , . . . ,  of  real  numbers,  where  each 

gives  the  performance  score  of  the  learning  algorithm  Ai  on 
the  jth  run  after  At  has  performed  an  amount  th  of  train¬ 


ing.'  Note  that  k  and  the  (1  <  h  <  k)  are  the  same  for  all 
algorithms,  but  I,  the  number  of  learning  curves  generated 
by  an  algorithm,  need  not  be  the  same  for  all  algorithms. 

We  will  test  two  null  hypotheses:  There  is  no  effect  of 
Algorithm  on  performance,  and  there  is  no  effect  of  Al¬ 
gorithm  on  the  relationship  between  Training  and  perfor¬ 
mance.  These  correspond  to  F  tests  of  a  main  effect  and 
the  interaction  effect  in  a  two-way  analysis  of  variance,  so 
we  will  compute  the  appropriate  statistics,  Faig  and  Fint, 
but  we  will  compare  them  to  the  randomized  sampling  dis¬ 
tributions  of  F^Jg  and  Fi*j. 

The  complete  procedure  can  be  summarized  as  follows: 

1.  For  each  algorithm  i,  collect  I  learning  curves 
, . . . ,  .  If  there  are  m  algorithms,  this  will  pro- 

‘The  “amount  of  training”  is  an  abstract  notion  here  which 
could  be  given  by  the  number  of  training  instances  processed,  the 
number  of  trials  run,  or  even  by  the  training  time. 
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Figure  4:  Histograms  generated  by  the  sam:  procedure  as  Figure  3,  but  p-values  were  compared  against  randomized  F 
distributions  (500  shuffles)  instead  of  the  parametric  distributions.  In  fact,  the  mean  rejection  rates  of  around  50  correspond 
to  the  target  significance  level  of  q  =  0.05.  This  is  also  true  for  the  corresponding  histograms  for  the  Interaction  effect 
(not  shown). 


duce  a  data  table  like  the  one  in  Figure  2. 

2.  Run  a  conventional  two-way  analysis  of  variance  on 
this  data  table  to  obtain  sample  statistics  Faig  and  Fjnt- 

3.  Generate  the  sampling  distributions  and  F*^^. 

Throw  the  m  X  ( learning  curves  into  a  “pool”  V. 

Do  i  =  1. .  .z  times  (where  z  is  large,  e.g., 

1000): 

(a)  Shuffle  V  and  reassign  each  of  the  ml  learn¬ 
ing  curves  to  the  m  algorithm  categories 
(rows  in  the  data  table)  such  that  each  row 
contains  I  curves.  Shuffling  V  enforces  the 
null  hypothesis  of  no  association  between 
performance  and  algorithm. 

(b)  Run  a  conventional  two-way  analysis  of  vari¬ 
ance  on  the  resulting  data  table  and  record 
^aig,i  and 

4.  Find  the  critical  values  in  the  distributions  ^a>g  and 
F*^^.  If  a  =  .05  and  z  =  1000  then  the  critical  value 
in  each  sorted  distribution  is  the  950th,  because  5%  of 
the  distribution  lies  above  this  value.  In  general,  the 
critical  value  is  the  alOOth  quantile. 

5.  If  Faig  exceeds  the  critical  value  for  the  distri¬ 
bution,  reject  the  null  hypothesis  that  Algorithm  does 
not  affect  performance.  Similarly  if  Fint  exceeds  the 
critical  value  for  the  Fi*^  distribution,  reject  the  null 
hypothesis  of  no  interaction  effect. 

6.  The  p  value  for  each  hypothesis  is  derived  from  the 
rank  of  the  closest  value  in  the  sorted  sampling  dis¬ 
tribution.  For  example,  if  Faig  =  10.3  and  the  closest 
value  in  F*^^  is  10.2,  and  if  the  rank  of  this  value  is  972 
out  of  1000,  then  p  <  (1000  -  972)/1000  =  .028. 


5  EXPERIMENTAL  RESULTS 

In  Section  3  we  illustrated  the  increase  in  Type  I  error 
caused  by  comparing  F  statistics  to  standard  F  distribu¬ 
tions.  This  section  provides  a  more  detailed  account  of  this 
phenomenon.  Both  Algorithm  and  Interaction  effects  arc 
analyzed  on  the  Chess  dataset  (see  Appendix).  The  fol¬ 
lowing  section  discusses  the  probability  of  Type  I  error, 
and  Section  5.2  compares  the  power  of  the  conventional 
and  randomized  ANOVAs.  In  all  cases  we  use  m  =  2  sets 
of  learning  curves.  Note  that  our  method  applies  to  any 
m  >  2. 

5.1  TYPE  I  ERROR  MEASUREMENTS 

As  shown  in  Section  3,  the  standard  F  distributions  tend  to 
overestimate  the  significance  of  Algorithm  effects,  but  un¬ 
derestimate  the  Interaction  effects.  We  expected  the  overes¬ 
timations  based  on  previously  published  results  (e.g.,  Kep- 
pel  1973,  p.  464)  but  the  underestimations  were  a  surprise 
and  we  do  not  have  a  satisfactory  explanation  for  this  phe¬ 
nomenon.  In  one  sense,  we  do  not  care  why  the  standard 
F  distributions  detect  Interaction  effects  less  often  than  ex¬ 
pected,  because  we  have  a  method  to  construct  correct  F 
distributions.  Yet  we  were  curious.  To  shed  some  light  on 
this  issue,  we  examined  the  frequency  of  TVpe  1  errors  for 
Interaction  and  Algorithm  effects,  for  conventional  ANOVA 
and  our  method,  in  a  variety  of  conditions. 

Recall  that  Type  1  error  rates  are  the  frequencies  with  which 
the  null  hypothesis  is  rejected  when  it  is  true,  i.e.,  when 
there  is  no  effect.  In  Section  3  we  enforced  the  null  hy¬ 
pothesis  by  splitting  a  set  of  learning  curves  generated  by 
one  algorithm  into  two  groups,  calling  one  group  “algo¬ 
rithm  A,”  the  other  “algorithm  B,”  then  testing  for  an  Al¬ 
gorithm  effect  and  an  Interaction  effect.  Because  the  two 
groups  were  generated  by  one  algorithm,  we  expected  nei¬ 
ther  effect;  that  is,  we  expected  Type  I  error  rates  of  a.  In 
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the  following  experiments  we  enforce  the  null  hypothesis 
in  a  slightly  different  way.  First  we  generated  a  set  L  of 
learning  curves  with  C4.5,  then  to  each  curve  we  applied 
a  transformation,  yielding  another  set  L'.  The  transforma¬ 
tion  induced  an  Algorithm  effect  or  an  Interaction  effect  or 
both.  In  other  words,  the  mean  curves  for  L  and  L'  corre¬ 
spond  to  the  pairs  of  curves  in  Figure  1.  Then,  to  enforce 
the  null  hypothesis,  we  shuffled  the  curves  in  L  and  L'. 
Whereas  the  earlier  procedure  enforced  the  null  hypothesis 
by  randomly  dividing  a  set  of  statistically-identical  learn¬ 
ing  curves,  this  procedure  is  more  natural  in  starting  with 
two  sets  of  curves  (L  and  L')  that  are  different,  then  shuf¬ 
fling  them.  Moreover,  we  have  tight  control  over  the  degree 
of  difference  between  L  and  L'  because  we  transform  the 
former  to  get  the  latter. 

We  now  describe  this  procedure  in  detail.  The  following 
steps  compute  the  number  c  of  rejections  of  Ho  during 
1000  analyses  of  variance,  starting  from  a  set  L  of  learn¬ 
ing  curves: 

Initialize  Cconv  =  Crand  =  0.  Then  do  1000  times: 

1 .  Construct  L'  by  modifying  each  curve  from  L  accord¬ 
ing  to  one  of  the  cases  given  in  Figure  1.  The  degree 
of  modification  is  controlled  by  a  factor  /.  We  will 
denote  this  operation  by  L'  =  Ma{L,  f)  for  case  a  in 
Figure  1,  and  likewise  for  cases  6,  c,  d. 

2.  Partition  LUL'  randomly  into  Li  and  L2,  with  |Li  |  = 
IL2I  =  20. 

3.  Perform  conventional  ANOVA  on  these  data  to  obtain 
the  F  statistic  for  the  tested  effect. 

4.  Compare  F  to  the  appropriate  conventional  F  distri¬ 
bution  and  read  off  the  probability  pconv  that  it  is  in¬ 
correct  to  reject  Hq. 

5.  Generate  a  randomized  sampling  distribution  F*  us¬ 
ing  400  shuffles  as  described  in  Section  4  item  3,  and 
read  off  Prand  • 

6.  Ifpconv  <  a  then  increment  Cconv 
Ifprand  <  «  then  increment  Grand- 

This  procedure  was  performed  with  respect  to  Algorithm 
and  Interaction  effects,  and  for  10  different  values  of  /. 
For  each  of  these  cases,  the  c  values  resulting  from  10  such 
runs  were  averaged  to  yield  a  data  point  shown  in  Figure  5. 
The  effect  of  the  modification  factor  /  on  the  shape  of  a 
curve  is  also  illustrated  in  the  figure.  Details  on  the  four 
modification  procedures  are  given  in  the  Appendix. 

As  expected,  the  randomized  ANOVA  always  achieves 
Type  I  error  probabilities  near  the  target  significance  level 


of  a  =  0.05.  The  conventional  method,  however,  tends  to 
assert  an  Algorithm  effect  too  often  (increase  in  Type  I  er¬ 
ror  probability).  In  contrast.  Interaction  effects  are  mostly 
detected  less  often  than  the  expected  5%. 

Modification  Mj  is  a  dramatic  case:  This  modification  did 
not  introduce  an  Algorithm  effect,  and  yet  such  an  effect 
was  often  detected  by  the  conventional  anova  at  a  fre¬ 
quency  inversely  proportional  to  the  modification  factor 
/.  The  modification  introduced  an  Interaction  effect  which 
was  then  shuffled  away,  enforcing  the  null  hypothesis  of  no 
interaction,  yet  the  frequency  with  which  conventional  AN¬ 
OVA  detected  Interaction  effects  increases  with  /.  We  do 
not  know  why,  and  these  experiments  fail  to  explain  why 
Type  I  errors  for  interaction  effects  are  lower  than  expected, 
although  the  dependence  on  /  is  intriguing. 

The  magnitude  of  these  misjudgments  can  be  quite  dra¬ 
matic  (up  to  a  factor  of  ten  in  these  examples),  but  depends 
on  the  type  of  the  effect  and  the  modification  factor  /.  Be¬ 
cause  of  these  dependencies,  we  think  it  is  not  possible  to 
correct  the  standard  F  statistics  to  control  Type  I  errors  pre¬ 
cisely.  No  matter:  Our  randomized  ANOVA  produces  the 
expected  Type  I  errors. 

5.2  POWER  MEASUREMENTS 

Whereas  Type  I  errors  involve  detecting  effects  that  don’t 
exist.  Type  H  errors  involve  failing  to  detect  errors  that  do 
exist.  The  power  of  a  test  is  one  minus  the  Type  II  error 
rate,  that  is,  the  probability  of  detecting  a  true  effect.  To 
measure  the  power  of  both  conventional  and  randomized 
versions  of  ANOVA,  we  employed  the  same  modification 
strategy  as  in  the  previous  section.  Here,  however,  L  and 
L'  are  not  shuffled.  In  other  words,  L  and  L'  give  us  con¬ 
trolled  Algorithm  and  Interaction  effects.  The  following 
procedure  measures  the  power  of  both  ANOVAs  to  detect 
these  effects: 

1.  Construct  L2  =  Mx{Li,f),  where  x  is  one  of 

2.  Generate  a  randomized  sampling  distribution  F*,  as 
described  in  Section  4  item  3,  using  5(X)  shuffles  of 
2  X  10  learning  curves  each. 

3.  Cconv  —  Crand  ~  0- 

4.  Do  100  times: 

(a)  Randomly  draw  a  set  L[  of  10  unique  curves 
from  Li . 

Randomly  draw  a  set  of  10  unique  curves 
fromZ/2- 

(b)  Perform  conventional  ANOVA  and  obtain  F. 
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Curve  Illustrations  Algorithm  Effect  Interaction  Effect 


Figure  5:  Effects  asserted  by  the  conventional  and  randomized  ANOVA  methods.  Each  row  shows  one  of  the  modification 
cases  a-d  from  Figure  1 .  The  left  column  illustrates  the  effect  of  the  modification  for  different  values  of  /  (/  =  0  means  no 
modification).  The  center  and  right  columns  plot  the  number  of  times  (of  I(KX))  the  conventional  and  randomized  analyses 
asserted  an  Algorithm  or  Interaction  effect  at  a  =  0.05. 
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(c)  Compare  F  to  the  parametric  F  distribution  and 
obtain  pconv  • 

Compare  F  to  the  randomized  F*  distribution 
and  obtain  Prand- 

(d)  If  Pconv  <  a  then  increment  Cconv • 

If  Prand  <  « then  increment  Crand* 

Divide  Cconv  and  Crand  by  100  to  obtain  the  power 
measurements. 


This  procedure  was  performed  to  introduce  Algorithm  and 
Interaction  effects  for  10  different  values  of  f.  For  each  of 
these  cases,  the  c  values  resulting  from  8  such  runs  were 
averaged  to  yield  a  data  point  shown  in  Figure  6. 

As  in  earlier  experiments,  the  conventional  ANOVA  usually 
overestimates  the  presence  of  an  Algorithm  elfect,  thus  it 
appears  more  powerful  than  our  randomized  ANOVA.  But 
this  “power”  is  illusory,  like  a  watchdog  that  barks  all  night 
whether  or  not  a  prowler  is  on  the  premises.  Sure,  the  dog 
will  bark  when  there  is  a  prowler  —  the  probability  of  de¬ 
tecting  a  prowler  is  1 .0 — but  it  is  a  useless  animal.  In  mod¬ 
ifications  a,  c  and  d,  where  Algorithm  effects  are  present, 
our  method  detects  them  handily  and  at  a  Type  I  error  rate 
of  approximately  5%.  In  case  b,  where  there  is  no  algorithm 
effect,  our  method  does  not  report  one,  but  the  conventional 
method  does.  Similarly,  for  interaction  effects,  our  method 
does  not  detect  one  in  case  a,  because  none  exists,  and  it  is 
quite  powerful  in  the  other  cases,  where  interaction  effects 
are  present. 


6  CONCLUSION 

We  have  presented  a  statistical  method  for  comparing  sets 
of  performance  curves,  such  as  learning  curves,  when 
points  on  the  curves  are  not  independent,  that  is,  when  there 
are  carryover  effects  and  homogeneity  of  covariance  is  vi¬ 
olated.  We  demonstrated  that  in  these  conditions  conven¬ 
tional  analysis  of  variance  produces  a  sometimes  dramatic 
surplus  of  Type  I  errors  for  main  (algorithm)  effects  and  a 
shortfall  of  Type  I  errors  for  interaction  effects.  Because 
the  magnitude  of  these  surpluses  and  shortfalls  depends  on 
the  original  dataset,  among  other  things,  we  do  not  think 
they  can  be  corrected  by  adjusting  conventional  F  statis¬ 
tics.  Instead  we  show  how  to  construct  sampling  distribu¬ 
tions  for  the  F  statistics  that  correct  for  violations  of  ho¬ 
mogeneity  of  covariance.  With  this  method,  one  can  con¬ 
trol  error  rates  precisely.  We  recommend  the  method  for 
its  simplicity  and  hope  it  will  be  a  helpful  addition  to  the 
statistical  toolbox  of  the  machine  learning  community. 


Figure  6:  Power  measurements  of  the  conventional  and 
randomized  ANOVA  methods.  Each  row  shows  one  of  the 
modification  cases  a-d  from  Figure  1.  The  horizontal  axes 
indicate  the  degree  /  to  which  one  of  one  underlying  two 
sets  of  curves  was  modified  with  respect  to  the  other  (see 
Figure  5). 
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Appendix:  Sources  of  Learning  Curves 

Chess:  Chess  Endgame  Database  (king-rook-vs-king, 
Bain  1994)  provided  by  the  UCI  Machine  Learning 
Repository  (Merz  and  Murphy  1996).  Twenty  Learn¬ 
ing  curves  were  generated  by  running  the  decision  tree 
algorithm  C4.5  (Quinlan  1993)  in  a  20-fold  cross  val¬ 
idation  procedure. 

We  now  describe  the  modification  functions  Mx{L,  f) 
used  in  Section  5.  In  the  following,  r  refers  to  the  dif¬ 
ference  between  the  performance  values  of  the  last  and 
first  points  of  a  given  learning  curve,  i.e.  r  —  Lk-  L\. 
For  each  learning  curve  L,  each  performance  value  Li 
is  altered  according  to  a  given  modification  case  (cf. 
Figure  1): 

(a)  Li 

(b)  Li 

(c)  Li 

(d)  Li 

RL:  These  data  were  generated  by  an  AI  program  that  em¬ 
ployed  TD(0)  Reinforcement  Learning  (Sutton  1988) 
to  learn  to  play  Tic-Tac-Toe  against  a  random  oppo¬ 
nent.  The  performance  score  was  the  cumulative  score 
of  one  hundred  test  games  against  a  random  player, 
where  losses,  draws  and  wins  scored  -1,0,  and  1  re¬ 
spectively.  Ten  learning  curves  were  generated  by  one 
training  session  each. 

Tic-Tac-Toe:  Tic-Tac-Toe  Endgame  Database  (Aha  1991) 
provided  by  the  UCI  Machine  Learning  Repository. 
Learning  curves  were  generated  as  with  the  Chess 
dataset. 
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Abstract 

The  classification  algorithm  CLEF  combines  a 
version  of  a  linear  machine  known  as  a  $- 
machine  with  a  non-linear  function  approxima¬ 
tor  that  constructs  its  own  features.  The  al¬ 
gorithm  finds  non-linear  decision  boundaries  by 
constructing  features  that  are  needed  to  learn  the 
necessary  discriminant  functions.  The  CLEF  al¬ 
gorithm  is  proven  to  separate  all  consistently  la¬ 
belled  training  instances,  even  when  they  are  not 
linearly  separable  in  the  input  variables.  The  al¬ 
gorithm  is  illustrated  on  a  variety  of  tasks,  show¬ 
ing  an  improvement  over  C4.5,  a  state-of-art  de¬ 
cision  tree  learning  algorithm. 


1  Introduction 

The  task  of  classification  is  to  find  an  approximate  defini¬ 
tion  for  an  unknown  function  /  :  X  ->  {ci,  ..cj?},  R>2 
based  on  a  set  of  training  examples  of  the  form  (xj,  / (xj)). 
The  components  of  an  instance  vector  Xj  can  take  values 
from  discrete  or  continuous  domains.  It  is  also  possible 
that  the  values  of  one  or  more  components  are  missing  or 
imprecisely  recorded  for  certain  training  instances,  or  that 
an  instance  is  mislabeled. 

This  paper  presents  a  different  approach  to  classification, 
centered  around  the  idea  of  constructing  a  machine  that  is 
linear  in  its  parameters,  but  non-linear  in  the  input  vari¬ 
ables.  Therefore,  the  algorithm  constructs  a  non-linear  fit 
of  the  data.  Unlike  decision  tree  induction,  the  method  does 
not  partition  the  data  into  subproblems.  The  whole  training 
set  is  used  at  all  the  stages  of  the  classifier’s  construction. 
The  algorithm  does  not  need  multiple  runs  to  achieve  good 
results,  and  finds  a  perfect  separation  of  the  training  in¬ 
stances  into  classes,  if  one  exists.  The  features  it  extracts 


from  the  data  have  a  logical  form,  and  thus  are  easy  to  in¬ 
terpret. 

2  Linear  Machines 

One  approach  that  constructs  a  classifier  using  all  the  train¬ 
ing  data  is  to  use  linear  machines  (Nilsson,  1965;  Duda  & 
Hart,  1973).  A  linear  machine  is  a  set  of  R  linear  discrim¬ 
inant  functions  gi  used  collectively  to  assign  an  instance  to 
one  of  jR  classes.  Let  x  =  (l,xi,  ..x„)  be  an  instance  de¬ 
scription.  Each  discriminant  function  pi(x)  has  the  form 
•wjx,  where  w  is  an  (n  -I-  l)-dimensional  vector  of  coeffi¬ 
cients  (weights).  An  instance  is  assigned  class  i  if  and  only 
if  pi(x)  >  gj{x)  Vj  ^  i.  If  a  tie  occurs,  the  instance  is 
attributed  randomly  to  one  of  the  classes. 

The  training  algorithm  of  a  linear  machine  adjusts  its 
weights  based  on  a  set  of  training  instances.  The  machine 
starts  with  arbitrary  initial  weights,  and  sweeps  through  the 
set  of  training  instances  repeatedly.  If  an  instance  having 
class  i  is  erroneously  placed  into  class  j,  the  weight  vec¬ 
tors  corresponding  to  the  two  classes  are  adjusted  as  fol¬ 
lows:  Wi  i-  Wi  +  cx  and  Wj  t-  Wj  -  cx.  The  amount 
of  correction  c  can  be  computed  using  the  fractional  error 
correction  rule: 


where  a  e  (0, 1)  is  the  step  size,  controlling  the  magnitude 
of  the  correction,  and  e  >  0  controls  the  “safety  margin” 
between  the  two  classes.  If  the  training  instances  are  lin¬ 
early  separable,  this  update  rule  guarantees  that  the  linear 
machine  will  converge  to  a  boundary  that  classifies  them 
correctly. 

For  many  tasks,  linear  combinations  of  the  input  values 
are  not  enough  to  discriminate  the  groups  of  instances  be¬ 
longing  to  each  class.  When  a  non-linear  discriminant  is 
needed,  one  possible  solution  is  to  use  a  ^-machine  (Nils- 
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son,  1965),  which  is  much  like  a  linear  machine,  except 
that  it  uses  discriminant  functions  of  the  form  gi{x)  = 
wjFi{x),  where  Fi  =  (/i, /m)  is  a  vector  of  linearly 
independent,  real,  single-valued  functions  /j  :  X  ->  3?,  in¬ 
dependent  of  the  weights.  This  means  that  fj  are  not  vary¬ 
ing  with  the  weight  adjustments.  Multilayered  neural  net¬ 
works,  for  instance,  do  not  satisfy  this  requirement,  since 
their  hidden  units  change  with  the  weight  adjustements. 

^-machines  preserve  the  theoretical  advantages  of  linear 
machines,  while  allowing  for  non-linear  combinations  of 
the  inputs.  Therefore,  ^-machines  can  represent  partitions 
of  the  input  space  that  cannot  be  represented  by  linear  ma¬ 
chines.  The  training  procedures  used  for  linear  machines 
can  be  applied  to  adjust  the  weights  of  ^-machines.  All 
the  convergence  theorems  for  linear  machines  apply  to  4>- 
machines  as  well. 

Due  to  the  great  variety  of  classification  tasks,  one  cannot 
know  a  priori  what  mappings  fj  would  be  useful  as  compo¬ 
nents  of  discriminants.  It  would  be  useful  to  construct  such 
functions  fj  automatically,  based  on  the  training  instances. 

3  Constructing  a  $ -machine  for 
classification 

Any  method  for  automatically  constructing  ^-machines 
needs  to  generate  functions  fj  that  are  linearly  indepen¬ 
dent  and  do  not  vary  when  the  parameters  Wj  of  the  ma¬ 
chine  are  adjusted.  Constructive  methods  that  adjust  the 
function  while  correcting  the  output  weights  (by  adjusting 
input  weights,  for  instance)  are  not  suitable  candidates,  be¬ 
cause  they  generate  functions  fj  that  are  not  independent 
of  the  machine  parameters  Wj . 

In  the  case  of  Boolean  input  variables,  one  alternative 
would  be  to  choose  fj  from  a  set  of  basis  functions,  such 
as  Rademacher- Walsh  or  Bahadur-Lazarsfeld  polynomials 
(Duda  and  Hart,  1973).  However,  if  the  fj  are  orthogonal 
(i.e.  fi-fj  =  0,  Vi  ^  j  and  /j  •  /j  0),  the  information  that 
can  be  gathered  during  training  can  only  say  whether  more 
terms  are  needed,  but  not  what  those  terms  should  be.  The 
search  for  a  good  set  of  discriminant  functions  is  therefore 
quite  difficult. 

An  automatic  method  for  constructing  a  ^-machine  ade¬ 
quate  for  the  task  at  hand  is  needed.  To  this  end,  we  use  the 
ELF  function  approximation  algorithm,  (Utgoff  &  Precup, 
1998)  which  constructs  new  features  as  needed,  by  iden¬ 
tifying  subsets  of  instances  that  share  intrinsic  properties. 
One  could  substitute  ELF  with  any  other  algorithm  that  can 
automatically  construct  linearly  independent  features. 

ELF  assumes  that  the  instances  are  represented  using 


Boolean  input  variables.  Its  goal  is  to  find  set  covers  over 
the  instance  space,  grouping  those  instances  into  subsets 
that  share  an  intrinsic  property,  i.e.  that  can  be  associated 
with  a  common  value.  Let  X  be  the  space  of  all  describable 
input  instances.  An  ELF  feature  is  a  membership  function 
for  a  subset  of  instances  Xj  C  X: 

/.(x)  =  /^  ifxEX, 

0  otherwise 

When  a  feature  fj  is  multiplied  by  its  single  corresponding 
weight,  each  term  Wjfj  has  value  Wj  for  the  instances  that 
Xj  covers,  and  0  elsewhere,  thus  associating  a  particular 
value  with  a  particular  set  of  instances. 

The  subset  Xj  is  represented  by  a  pattern  vector  with  as 
many  components  as  the  dimensionality  of  an  instance  vec¬ 
tor  X.  Each  component  of  a  pattern  has  either  the  value  *#’ 
or  the  value  ‘0’ .  A  matehes  either  of  the  possible  values 
of  the  corresponding  input  vector,  while  a  ‘0’  in  the  pattern 
matches  only  a  ‘0’  value.  For  example,  the  pattern  ‘#0’ 
covers  the  instances  ‘  1 0’  and  ‘(X)’  and  does  not  cover  either 
‘Or  or  ‘ir.  The  pattern  of  all  *#’  covers  every  domain 
element  because  the  pattern  matches  any  domain  element 
at  every  component.  One  pattern  is  more  general  than  an¬ 
other  if  and  only  if  it  covers  all  the  instances  covered  by  the 
other,  and  some  additional  instances  as  well. 

Initially,  each  discriminant  function  consists  of  one  feature, 
which  covers  the  whole  instance  space,  and  has  a  weight 
of  0.  To  evaluate  an  instance  using  a  discriminant  func¬ 
tion,  one  computes  the  linear  combination  of  the  feature 
values  and  feature  weights.  To  update  the  approximation, 
the  training  procedure  revisits  the  training  instances  and  ad¬ 
justs  the  weights  of  the  discriminant  functions  using  the 
fractional  error  correction  rule  (Nilsson,  1965).  Only  fea¬ 
tures  that  matched  the  instance  have  their  weights  adjusted, 
because  features  that  did  not  match  have  value  0. 

For  each  feature,  the  algorithm  keeps  track  of  the  errors  as¬ 
sociated  with  each  input  bit,  in  order  to  determine  which 
feature  is  having  the  greatest  difficulty  in  fitting.  When  an 
adjustment  of  the  weights  has  ceased  to  be  productive,  the 
algorithm  adds  a  new  feature,  which  is  a  specialization  of 
the  feature  that  has  been  producing  the  largest  errors.  Spe¬ 
cialization  is  performed  by  copying  the  feature  and  chang¬ 
ing  a  *#’  in  its  pattern  to  a  ‘O’.  The  choice  of  the  bit  to 
specialize  is  based  on  the  variance  of  the  input  errors  for 
each  feature.  The  bit  whose  errors  are  most  different  from 
the  mean  bit  error  of  the  feature  is  specialized.  The  new 
feature  will  cover  half  of  the  set  covered  by  its  “parent”. 

The  features  that  are  created  by  this  procedure  are  linearly 
independent.  The  proof  of  this  statement  can  be  done  by  in¬ 
duction  on  the  number  n  of  bits  that  are  present  in  an  input 
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instance.  Consider  the  base  case,  in  which  n  =  1.  The  in¬ 
stance  space  contains  two  instances:  ‘0’  and  ‘1’.  There  are 
two  features  that  can  be  defined  over  this  instance  space: 
the  most  specialized  feature,  which  is  associated  with  the 
pattern  ‘0’  and  only  covers  the  first  instance,  and  the  most 
general  feature,  which  corresponds  to  the  pattern  *#’  and 
covers  both  instances.  The  values  of  the  features  for  each 
instance  can  be  tabulated  in  the  following  determinant: 


0 

1 


0  # 
1  1 
0  1 


which  can  be  reduced  to  a  unit  determinant,  by  subtracting 
the  last  line  from  the  first  one. 

Now  comes  the  induction  step.  Consider  the  space  of  the 
instances  that  can  be  generated  by  n  input  bits.  These  in¬ 
stances  can  be  viewed  as  being  generated  from  the  (n  —  1)- 
bit  instances,  by  adding  a  ‘0’  or  a  ‘1’  on  the  first  position  of 
the  vector.  Similarly,  the  features  that  can  be  defined  over 
these  instances  are  generated  from  (n  -  l)-bit  features  by 
adding  a  ‘0’  or  a  *#’  on  the  first  position  of  the  feature.  Let 
dn-i  define  the  determinant  of  the  (n  —  l)-bit  space  in¬ 
put  features.  The  determinant  d„  on  the  n-bit  space  can  be 
written  as: 


Oi^n-l  #Fn-l 
dn—l  dn—i 
1-^n— 1  0  dn—l 

The  induction  hypothesis  is  that  d„_i  can  be  reduced  to 
a  unit  determinant.  This  can  be  done  by  adding  and  sub¬ 
tracting  lines  from  each  other,  as  we  did  in  the  base  case. 
If  there  is  a  sequence  of  transformations  that  achieves  this 
goal,  we  can  apply  it  in  the  upper  and  lower  part  of  The 
resulting  determinant  will  have  the  form: 


1 

0  . 

,.  0 

1 

0 

...  0 

0 

1  . 

..  0 

0 

1 

...  0 

0 

0  . 

,.  1 

0 

0 

...  1 

0 

0  . 

..  0 

1 

0 

...  0 

0 

0  . 

..  0 

0 

1 

...  0 

0 

0  . 

..  0 

0 

0 

...  1 

By  subtracting  the  bottom  half  of  the  determinant  from  the 
upper  half,  can  also  be  reduced  to  a  unit  determinant. 
Thus,  the  set  of  all  possible  features  is  linearly  indepen¬ 
dent.  This  means  that  any  subset  of  features  will  be  linearly 
independent  as  well.  ■ 


The  process  of  training  CLEF’s  classifier  can  be  viewed 
as  constructing  a  sequence  of  ^-machines.  The  previous 
proof  ensures  that  at  any  point  between  two  feature  addi¬ 
tions,  the  classifier  that  is  built  is  a  ^-machine.  A  machine 
will  converge  to  a  set  of  weights  that  separates  the  train¬ 
ing  instances,  if  a  separation  is  possible  given  the  current 
set  of  features.  If  no  linear  separation  can  be  found  given 
the  current  feature  set,  by  gradually  reducing  the  size  of 
the  corrections,  the  weights  will  still  settle  into  a  particular 
range  (Frean,  1990). 

In  this  case,  a  new  feature  will  be  added,  and  training  will 
resume  with  a  new  machine.  In  the  worst  case,  the  pro¬ 
cess  will  continue  until  all  the  2”  features  that  are  possible 
have  been  generated.  If  the  instances  are  separable  when 
mapped  through  a  subset  of  the  features,  they  will  also  be 
separable  when  the  whole  set  is  used.  Thus,  if  a  linear  sep¬ 
aration  of  the  training  instances  is  possible,  the  algorithm 
is  guaranteed  to  find  one.  In  practice,  CLEF  also  proved  to 
be  quite  efficient  with  respect  to  the  number  of  features  it 
generates  for  a  particular  instance  space. 

4  Input  representation 

The  non-linear  machine  described  so  far  requires  boolean 
input  values.  Such  an  encoding  can  be  generated  auto¬ 
matically  for  classification  tasks.  Symbolic  variables  are 
mapped  into  a  1-of-m  encoding,  where  m  is  the  number  of 
possible  values  for  each  variable.  A  variable  v  with  possi¬ 
ble  values  vi ,  ...Vm  is  represented  in  m  bits.  Bit  j  will  have 
the  value  1  in  an  instance  representation  if  and  only  if  the 
test  v(x)  =  Vj  is  true. 

Since  ELF  only  deals  with  Boolean  inputs,  some  form  of 
discretization  is  needed  for  continuous  variables.  We  have 
experimented  with  two  methods  for  discretizing  the  contin¬ 
uous  variables.  The  first  method  was  suggested  by  Fayyad 
and  Irani  (1993).  The  basic  mechanism  is  to  sort  the  in¬ 
stance  class  labels  based  on  the  value  of  the  countinuous 
variable.  The  points  at  which  the  class  label  changes  are 
potential  outpoints  for  the  variable.  At  each  step,  the  al¬ 
gorithm  looks  at  the  list  of  possible  outpoints  and  deter¬ 
mines  the  information  gain  for  each  partition  generated  by 
the  outpoint.  A  outpoint  is  accepted  if  its  information  gain 
is  above  a  certain  threshold,  and  in  this  case  the  algorithm 
proceeds  recursively  to  partition  the  sub-intervals  left  and 
right  of  the  outpoint.  We  found  this  method  to  be  quite  con¬ 
servative  in  the  number  of  intervals  used  in  the  discretiza¬ 
tion,  which  led  to  poor  performance  when  used  for  our  clas¬ 
sification  algorithm. 

The  second  method  was  originally  proposed  by  Fulton, 
Kasif  and  Salzberg  (1995)  and  then  extended  hy  Elomaa 
and  Rousou  (1996).  In  this  case,  the  algorithm  searches 
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for  the  best  split  with  a  given  maximum  number  of  inter¬ 
vals.  The  quality  of  a  partition  is  evaluated  by  an  impurity 
measure,  and  the  efficiency  of  the  search  is  ensured  by  a 
dynamic  programming  algorithm.  The  impurity  measure 
used  for  the  experiments  reported  in  this  paper  is  informa¬ 
tion  gain.  Based  on  the  intervals  determined  in  this  way,  the 
continuous  values  for  all  the  instances  are  transformed  into 
a  1-of-m  encoding,  with  one  bit  for  each  of  the  m  intervals. 

The  number  of  bits  representing  each  input  variable  varies 
widely.  If  the  input  variables  were  coded  in  the  same 
number  of  bits,  the  probability  of  any  input  bit  having  the 
value  1  is  equal,  assuming  that  all  the  input  instances  are 
equiprobable.  For  variables  coded  with  different  numbers 
of  bits,  the  probability  of  a  bit  corresponding  to  a  low  ar- 
ity  variable  being  on  is  higher  than  the  probability  of  a  bit 
being  on  for  a  high  arity  variable.  A  simple  adjustment  is 
used  to  remove  this  bias;  the  error  attributed  to  each  bit  is 
normalized  with  respect  to  the  number  of  bits  used  to  en¬ 
code  the  variable  to  which  the  bit  belongs. 

* 

To  handle  missing  values,  if  the  value  of  a  variable  is  miss¬ 
ing  in  the  input  then  all  the  bits  corresponding  to  that  vari¬ 
able  are  set  to  0.  This  prevents  the  missing  value  from  hav¬ 
ing  any  role  in  the  classification  process,  since  it  will  not 
interfere  with  the  matching  (all  features  will  match  at  that 
input  variable). 

5  Illustration 

The  Boolean  encoding  of  the  features  allows  an  interpre¬ 
tation  of  the  units  that  form  a  non-linear  classification  ma¬ 
chine.  Feature  interpretation  can  be  generated  automati¬ 
cally,  by  printing  the  negation  of  each  test  for  which  there 
is  a  ‘0’  in  the  feature’s  pattern. 

Table  1  illustrates  the  features  that  have  been  constructed 
for  one  of  the  units  (discriminant  functions)  in  the  hepatitis 
task  from  the  UCI  data  repository  (Murphy  and  Aha,  1994). 
This  is  a  two-class  problem,  thus  the  corresponding  linear 
machine  will  have  two  discriminant  functions,  one  for  each 
class.  However,  due  to  the  training  procedure,  these  dis¬ 
criminant  functions  are  always  trained  with  equal  amounts 
of  error  having  opposite  signs.  In  this  two  class  case,  the 
functions  end  up  having  the  same  features,  with  weights  of 
opposite  sign. 

This  table  is  analogous  to  a  “health  test”,  which  tells 
how  to  compute  a  score  for  an  instance.  For  each  line 
in  the  table,  one  would  check  if  the  instance  satisfies 
the  test  in  the  right  column.  If  so,  the  corresponding 
weight  would  be  added  to  the  total  score.  If  the  total 
score  is  positive,  the  instance  would  be  considered  as 
belonging  to  the  “die”  class.  For  example,  a  patient  with 


Table  1:  Unit  corresponding  to  the  “die”  class  in  the  hep¬ 
atitis  task 


Hepatitis 


Weight 

Feature 

-0.019 

age  ^  37.50 

0.013 

ascites  /  no 

-0.012 

age  ft  37.50,  liver-firm  yes,  spiders  no. 

varices  ^  no 

0.008 

intercept  term 

0.008 

age  ^  37.50,  protime  </i  44.50 

0.008 

age  37.50,  varices  no 

0.007- 

age  37.50,  spiders  ^  no,  varices  no 

-0.006 

sgot  5^  80.50,  protime  ^  87.50 

-0.005 

steroid  ^  yes 

-0.004 

bilimbin  ■f.  1.35 

-0.004 

protime  ^  87.50 

-0.004 

sex  /  female 

0.003 

sex  ^  female,  anorexia  /  yes 

-0.002 

sex  /  female,  liver-firm  no 

0.001 

spiders  ^  no,  histology  /  yes 

-0.000 

spiders  /  no 

the  following  characteristics:  age=30,  ascites=yes, 
spiders=no,  sex=female,  steroid^no,  sgot=79.6, 
steroid=no,  bilirubin=2,  protimc=80  will  be  evalu¬ 
ated  to  a  score  of  0.013  -f  0.008  -  0.005  -  0.004  =  0.0012, 
and  will  therefore  be  classifed  as  belonging  to  the  “die” 
class. 

6  Analysis 

How  does  CLEF  perform  compared  with  other  classifica¬ 
tion  algorithms?  Will  it  find  a  separating  4>-machine  in 
a  reasonable  amount  of  time?  Will  it  construct  a  large 
number  of  features,  perhaps  producing  an  incomprehensi¬ 
ble  classifier? 

In  order  to  answer  these  questions  empirically,  CLEF  and 
C4.5  were  run  on  several  classification  tasks,  mostly  from 
the  UCI  data  repository  (Murphy  and  Aha,  1994).  This 
allows  for  a  comparison  in  terms  of  classification  accuracy, 
and  provides  some  insight  on  the  efficiency  of  CLEF  and 
the  form  of  the  function  it  provides. 

The  salient  difference  between  CLEF  and  decision  tree  in¬ 
ducers  is  that  CLEF  uses  all  the  training  set  to  construct 
its  classifier.  It  should  be  advantageous  to  CLEF  that  it 
solves  one  classification  problem  using  all  the  data,  instead 
of  many  subproblems,  each  using  only  some  of  the  data. 

CLEF  was  trained  by  repeatedly  sampling  at  random  N  = 
100]X|  times  from  the  training  set  (where  |X|  is  the  size 
of  the  training  set),  for  a  fixed  number  of  epochs.  Training 
can  stop  early,  if  the  instances  in  the  training  set  are  per¬ 
fectly  separated.  For  C4.5,  the  default  settings  were  used 
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Table  2:  Accuracy  results 


Task 

C4.5 

C4.5p 

CLEF 

audio-no-id 

75.7  ±  9.6 

77.8  ±  6.6 

79.1  ±  9.1 

balance-scale 

78.3  ±  4.1 

77.5  ±  3.2 

92.5  ±  4.0 

breast-cancer 

66.2  ±  6.9 

75.5  ±  3.9 

70.3  ±  7.1 

bupa 

64.6  ±  5.3 

64.6  ±  5.6 

68.7  ±  5.0 

Cleveland 

46.8  ±  4.1 

46.8  ±  5.4 

48.7  ±  8.4 

hepatitis 

76.9  ±  4.9 

77.5  ±  5.7 

81.9  ±  5.2 

iris 

94.4  ±  7.6 

94.4  ±  7.6 

94.4  ±  7.1 

led24 

61.0  ±  9.0 

62.4  ±  9.4 

61.9  ±  11.1 

lymphography 

77.3  ±  12.4 

78.0  ±11.9 

80.7  ±  6.3 

monks-2 

44.5  ±  9.3 

65.9  ±  0.0 

92.3  ±  4.8 

mplex-6 

57.1  ±20.2 

57.1  ±  19.2 

91.4  ±14.6 

promoter 

80.9  ±  14.3 

77.3  ±  14.2 

87.3  ±  6.0 

soybean 

90.3  ±  2.8 

92.2  ±  2.4 

91.9  ±  3.1 

Switzerland 

32.3  ±  9.6 

33.1  ±  7.7 

35.4  ±  14.7 

tictactoe 

66.3  ±  2.0 

68.1  ±  2.3 

78.4  ±  2.8 

va 

28.1  ±  12.7 

26.7  ±  10.0 

32.9  ±  7.2 

votes 

95.7  ±  3.7 

96.6  ±  3.3 

94.3  ±  3.1 

waveform 

69.7  ±  10.4 

70.0  ±  10.7 

73.9  ±  9.1 

wine 

93.3  ±  6.0 

93.3  ±  6.0 

94.2  ±  8.3 

zoo 

92.7  ±  6.8 

91.8  ±  6.4 

96.4  ±  4.5 

69.6 

71.4 

77.3 

Table  3:  Duncan  Multiple  Range  Test 

C4.5  C4.5p  CLEF 

69.6  71.4  77.3 


(Quinlan,  1993),  both  with  and  without  pruning.  The  rea¬ 
son  for  including  the  results  without  pruning  as  well  is  that 
CLEF  does  not  currently  use  any  mechanism  for  avoiding 
overfitting.  Therefore,  using  C4.5  without  pruning  offers 
some  insight  into  the  comparative  quality  of  the  learning 
algorithm  itself,  though  we  would  like  to  devise  a  pruning 
mechanism  for  CLEF. 

Table  2  shows  the  accuracy  results  of  the  two  algorithms, 
in  terms  of  the  mean  and  standard  deviation  for  each  task. 
All  values  are  computed  from  a  ten-fold  stratified  cross- 
validation,  with  CLEF  and  C4.5  using  the  same  partitions 
for  each  task.  As  shown  in  the  table,  CLEF  constructs  more 
accurate  classifiers  than  C4.5  without  pruning  on  19  of  the 
20  tasks  considered.  The  classifiers  are  also  more  accurate 
that  those  constructed  by  C4.5  with  pruning  on  15  out  of 
the  20  datatsets  considered.  By  doing  one-way  ANOVA, 
the  difference  between  CLEF  and  C4.5  with  no  pruning 
is  significant  at  the  0.05  level.  The  difference  with  C4.5 
with  pruning  is  not  statistically  significant.  These  results 
are  confirmed  also  by  the  Duncan  Multiple  Range  Test  (as 
shown  in  Table  3).  There  is  a  statistical  difference  between 
CLEF  and  C4.5  without  pruning,  but  there  is  no  statistical 
difference  between  CLEF  and  C4.5  with  pruning. 


Table  4:  Characteristics  of  the  classifier  produced 


Task 

CPU  CLEF 

Size  CLEF 

Match 

audio-no-id 

218.2± 

42.2 

88.0  ± 

2.8 

77.3  ±0.8 

balance-scale 

59.9  ± 

37.8 

39.0  ± 

2.3 

66.6  ±1.9 

breast-cancer 

191.2± 

8.7 

47. 1± 

1.8 

47.7  ±1.6 

bupa 

245.4  ± 

35.8 

49.2  ± 

2.4 

64.9  ±4.6 

Cleveland 

496.0  ± 

46.2 

117.4± 

6.1 

72.1  ±2.6 

hepatitis 

58.8  ± 

18.2 

17.4± 

1.0 

50.4  ±4.9 

iris 

15.4± 

12.8 

19.0± 

8.2 

74.2  ±6.9 

led24 

36.9  ± 

9.4 

76.8  ± 

4.9 

54.2±1.3 

lymphography 

39.7  ± 

15.3 

28.6  ± 

2.8 

66.8  ±2.3 

monks-2 

926.2±515.1 

59.3  ± 

8.3 

27.6  ±2.2 

mplex-6 

0.8  ± 

0.8 

11.5± 

1.8 

37.6  ±2.2 

promoter 

26.9  ± 

6.2 

7.8  ± 

0.4 

64.8±1.6 

soybean 

1684.0± 

30.5 

96.9  ± 

2.5 

71.5±0.8 

Switzerland 

182.8  ± 

16.9 

72.2  ± 

4.5 

85.9  ±4.1 

tictactoe 

5792.5  ±236.0 

241.6± 

14.0 

29.6±1.1 

va 

351.2± 

28.8 

123.0  ± 

7.0 

70.5  ±1.9 

votes 

22.3  ± 

1.3 

14.3  ± 

1.4 

46.3  ±3.4 

waveform 

346.6  ± 

105.3 

43.6  ± 

4.9 

79.5  ±1.9 

wine 

14.2± 

4.6 

16.5  ± 

3.6 

81.3±4.0 

zoo 

3.1± 

0.7 

19.0± 

1.2 

75.8  ±3.0 

CPU  and  memory  costs  are  indicated  in  Table  4.  Compu¬ 
tationally,  the  CLEF  algorithm  is  more  costly  than  C4.5. 
Memory  costs  are  not  large.  The  table  presents  the  mem¬ 
ory  requirements  of  the  resulting  classifier  in  terms  of  the 
total  number  of  features  present  in  the  machine.  CLEF  typ¬ 
ically  constructs  a  small  set  of  features,  each  of  which  con¬ 
sists  of  a  simple  bit  pattern  and  a  single  weight.  In  order 
to  measure  the  degree  of  overlap  of  the  features  that  form 
a  classifier,  the  average  percentage  of  features  matching  an 
instance  was  evaluated.  The  “match  train”  column  shows 
this  measure  for  the  instances  in  the  training  set.  The  values 
show  that  there  is  a  high  degree  of  overlap  in  the  features 
that  are  constructed. 

7  Related  work 

A  variety  of  constructive  methods  have  been  devised  for 
classification  problems.  A  large  class  of  algorithms  con¬ 
struct  networks  of  thresholded  logic  units,  by  adding 
boundaries  that  correct  for  misclassified  examples  (Parekh 
et.  al,  1997).  These  algorithms  also  separate  consistently 
labelled  examples.  The  experimental  results  that  have  been 
published  regarding  these  algorithms  are  limited,  so  they 
do  not  provide  a  good  basis  for  comparison  with  CLEF. 

Several  algorithms  that  automatically  construct  a  neural 
network  configuration  have  also  been  used  in  classifica¬ 
tion  tasks.  Fahlman  and  Lebiere’s  (1990)  cascade  corre¬ 
lation  method  constructs  a  new  hidden  unit  (feature)  in  or¬ 
der  to  minimize  the  residual  error  and  freezes  its  defining 
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weights.  The  original  input  variables  and  the  newly  con¬ 
structed  unit  become  the  input  variables  for  the  next  layer. 
The  algorithm  has  produced  good  results  when  applied  to 
classification  tasks.  Wynne- Jones  (1992)  presents  an  ap¬ 
proach  called  node  splitting  that  detects  when  the  hyper¬ 
plane  of  a  hidden  unit  is  oscillating,  indicating  that  the  unit 
is  being  pushed  in  conflicting  directions  in  feature  space. 
Such  a  unit  is  split  into  two  units,  and  the  weights  are  set 
so  that  the  units  are  moved  apart  from  each  other  along  an 
advantageous  axis.  A  meiosis  network  (Hanson,  1990)  is  a 
feed-forward  network  in  which  the  variance  of  each  weight 
is  maintained.  For  a  hidden  unit  (feature)  that  has  one  or 
more  weights  of  high  variance,  the  unit  is  split  into  two. 
The  input  weights  that  define  the  feature,  and  the  output 
weight  for  the  linear  combination  are  altered  so  that  the 
two  units  are  moved  away  from  their  means  in  opposite  di¬ 
rections. 

Support  Vector  Machines  (Vapnik,  1995)  can  also  be 
viewed  as  constructing  features  automatically,  but  the  form 
of  the  features  that  are  constructed  needs  to  be  defined  a 
priori.  More  work  would  be  needed  to  explore  the  relation¬ 
ship  between  CLEF  and  support  vector  machines. 

8  Summary 

CLEF  is  a  classification  algorithm  that  constructs  a  $- 
machine  to  fit  the  multiclass  data.  By  using  the  ELF  func¬ 
tion  approximator,  non-linear  features  are  constructed  as 
needed.  The  sequence  of  feature  sets  produced  by  ELF  has 
the  effect  that  CLEF  produces  a  sequence  of  ^-machine 
classifiers.  This  sequence  will  ultimately  produce  a  $- 
machine  that  separates  the  instances,  whether  or  not  they 
are  linearly  separable  in  the  input  variables.  By  contrast  to 
decision  trees,  which  recursively  partition  the  training  in¬ 
stances,  CLEF  constructs  a  classifier  using  the  whole  train¬ 
ing  set.  This  approach  provides  an  advantage  in  terms  of 
the  accuracy  of  the  resulting  classifiers. 
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Abstract 

We  analyze  critically  the  use  of  classifica¬ 
tion  accuracy  to  compare  classifiers  on  natu¬ 
ral  data  sets,  providing  a  thorough  investiga¬ 
tion  using  ROC  analysis,  standard  machine 
learning  algorithms,  and  standard  bench¬ 
mark  data  sets.  The  results  raise  serious  con¬ 
cerns  about  the  use  of  accuracy  for  comparing 
classifiers  and  draw  into  question  the  conclu¬ 
sions  that  can  be  drawn  from  such  studies. 

In  the  course  of  the  presentation,  we  describe 
and  demonstrate  what  we  believe  to  be  the 
proper  use  of  ROC  analysis  for  comparative 
studies  in  machine  learning  research.  We  ar¬ 
gue  that  this  methodology  is  preferable  both 
for  making  practical  choices  and  for  drawing 
scientific  conclusions. 

1  INTRODUCTION 

Substantial  research  has  been  devoted  to  the  devel¬ 
opment  and  analysis  of  algorithms  for  building  clas¬ 
sifiers,  and  a  necessary  part  of  this  research  involves 
comparing  induction  algorithms.  A  common  method¬ 
ology  for  such  evaluations  is  to  perform  statistical 
comparisons  of  the  accuracies  of  learned  classifiers 
on  suites  of  benchmark  data  sets.  Our  purpose  is 
not  to  question  the  statistical  tests  (Dietterich,  1998; 
Salzberg,  1997),  but  to  question  the  use  of  accuracy 
estimation  itself.  We  believe  that  since  this  is  one  of 
the  primary  scientific  methodologies  of  our  field,  it  is 
important  that  we  (as  a  scientific  community)  cast  a 
critical  eye  upon  it. 

The  two  most  reasonable  justifications  for  comparing 
accuracies  on  natural  data  sets  require  empirical  ver¬ 
ification.  We  argue  that  a  particular  form  of  ROC 


analysis  is  the  proper  methodology  to  provide  such 
verification.  We  then  provide  a  thorough  analysis  of 
classifier  performance  using  standard  machine  learning 
algorithms  and  standard  benchmark  data  sets.  The  re¬ 
sults  raise  serious  concerns  about  the  use  of  accuracy, 
both  for  practical  comparisons  and  for  drawing  scien¬ 
tific  conclusions,  even  when  predictive  performance  is 
the  only  concern. 

The  contribution  of  this  paper  is  two-fold.  We  analyze 
critically  a  common  assumption  of  machine  learning 
research,  provide  insights  into  its  applicability,  and  dis¬ 
cuss  the  implications.  In  the  process,  we  describe  what 
we  believe  to  be  a  superior  methodology  for  the  eval¬ 
uation  of  induction  algorithms  on  natural  data  sets. 
Although  ROC  Einalysis  certainly  is  not  new,  for  ma¬ 
chine  learning  research  it  should  be  applied  in  a  princi¬ 
pled  manner  geared  to  the  specific  conclusions  machine 
learning  researchers  would  like  to  draw.  We  hope  that 
this  work  makes  significant  progress  toward  that  goal. 

2  JUSTIFYING  ACCURACY 
COMPARISONS 

We  consider  induction  problems  for  which  the  intent  in 
applying  machine  learning  algorithms  is  to  build  from 
the  existing  data  a  model  (a  classifier)  that  will  be 
used  to  classify  previously  unseen  examples.  We  limit 
ourselves  to  predictive  performance — which  is  clearly 
the  intent  of  most  accuracy-based  machine  learning 
studies — and  do  not  consider  issues  such  as  compre¬ 
hensibility  and  computational  performance. 

We  assume  that  the  true  distribution  of  examples  to 
which  the  classifier  will  be  applied  is  not  known  in 
advance.  To  make  an  informed  choice,  performance 
must  be  estimated  using  the  data  available.  The 
different  methodologies  for  arriving  at  these  estima¬ 
tions  have  been  described  elsewhere  (Kohavi,  1995; 
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Dietterich,  1998).  By  far,  the  most  commonly  used 
performance  metric  is  classification  accuracy. 

Why  should  we  care  about  comparisons  of  accuracies 
on  benchmark  data  sets?  Theoretically,  over  the  uni¬ 
verse  of  induction  algorithms  no  algorithm  will  be  su¬ 
perior  on  all  possible  induction  problems  (Wolpert, 
1994;  Schaffer,  1994).  The  tacit  reason  for  comparing 
classifiers  on  natural  data  sets  is  that  these  data  sets 
represent  problems  that  systems  might  face  in  the  real 
world,  and  that  superior  performance  on  these  bench¬ 
marks  may  translate  to  superior  performance  on  other 
real-world  tasks.  To  this  end,  the  field  has  amassed 
an  admirable  collection  of  data  sets  from  a  wide  vari¬ 
ety  of  classifier  applications  (Merz  and  Murphy,  1998). 
Countless  research  results  have  been  published  based 
on  comparisons  of  classifier  accuracy  over  these  bench¬ 
mark  data  sets.  We  argue  that  comparing  accuracies 
on  our  benchmark  data  sets  says  little,  if  anything, 
about  classifier  performance  on  real-world  tasks. 

Accuracy  maximization  is  not  an  appropriate  goal  for 
many  of  the  real-world  tasks  from  which  our  natural 
data  sets  were  taken.  Classification  accuracy  assumes 
equal  misclassification  costs  (for  false  positive  and  false 
negative  errors) .  This  assumption  is  problematic,  be¬ 
cause  for  most  real-world  problems  one  type  of  clas¬ 
sification  error  is  much  more  expensive  than  another. 
This  fact  is  well  documented,  primarily  in  other  fields 
(statistics,  medical  diagnosis,  pattern  recognition  and 
decision  theory).  As  an  example,  consider  machine 
learning  for  fraud  detection,  where  the  cost  of  missing 
a  case  of  fraud  is  quite  different  from  the  cost  of  a  false 
alarm  (Fawcett  and  Provost,  1997). 

Accuracy  maximization  also  assumes  that  the  class 
distribution  (class  priors)  is  known  for  the  target  envi¬ 
ronment.  Unfortunately,  for  our  benchmark  data  sets, 
we  often  do  not  know  whether  the  existing  distribu¬ 
tion  is  the  natural  distribution,  or  whether  it  has  been 
stratified.  The  iris  data  set  has  exactly  50  instances  of 
each  class.  The  splice  junction  data  set  (DNA)  has 
50%  donor  sites,  25%  acceptor  sites  and  25%  non¬ 
boundary  sites,  even  though  the  natural  class  distri¬ 
bution  is  very  skewed:  no  more  than  6%  of  DNA  ac¬ 
tually  codes  for  human  genes  (Saitta  and  Neri,  1998). 
Without  knowledge  of  the  target  class  distribution  we 
cannot  even  claim  that  we  are  indeed  maximizing  ac¬ 
curacy  for  the  problem  from  which  the  data  set  was 
drawn. 

If  accuracy  maximization  is  not  appropriate,  why 
would  we  use  accuracy  estimates  to  compare  induc¬ 
tion  algorithms  on  these  data  sets?  Here  are  what  we 


believe  to  be  the  two  best  candidate  justifications. 

1.  The  classifier  with  the  highest  accuracy  may  very 
well  be  the  classifier  that  minimizes  cost,  particu¬ 
larly  when  the  classifier’s  tradeoff  between  true 
positive  predictions  and  false  positives  can  be 
tuned.  Consider  a  learned  model  that  produces 
probability  estimates;  these  can  be  combined  with 
prior  probabilities  and  cost  estimates  for  decision- 
analytic  classifications.  If  the  model  has  high  clas¬ 
sification  accuracy  because  it  produces  very  good 
probability  estimates,  it  will  also  have  low  cost  for 
any  target  scenario. 

2.  The  induction  algorithm  that  produces  the 
highest  accuracy  classifiers  may  also  produce 
minimum-cost  classifiers  by  training  it  differently. 
For  example,  Breiman  et  al.  (1984)  suggest  that 
altering  the  class  distribution  will  be  effective 
for  building  cost-sensitive  decision  trees  (see  also 
other  work  on  cost-sensitive  classification  (Tur¬ 
ney,  1996)). 

To  criticize  the  practice  of  comparing  machine  learn¬ 
ing  algorithms  based  on  accuracy,  it  is  not  sufficient 
merely  to  point  out  that  accuracy  is  not  the  metric  by 
which  real-world  performance  will  be  measured.  In¬ 
stead,  it  is  necessary  to  analyze  whether  these  candi¬ 
date  justifications  are  well  founded. 

3  ARE  THESE  JUSTIFICATIONS 
REASONABLE? 

We  first  discuss  a  commonly  cited  special  case  of  the 
second  justification,  arguing  that  it  makes  too  many 
untenable  assumptions.  We  then  present  the  results 
of  an  empirical  study  that  leads  us  to  conclude  that 
these  justifications  are  questionable  at  best. 

3.1  CAN  WE  DEFINE  AWAY  THE 
PROBLEM? 

In  principle,  for  a  two-class  problem  one  can  repropor¬ 
tion  (“stratify”)  the  classes  based  on  the  target  costs 
and  class  distribution.  Once  this  has  been  done,  max¬ 
imizing  accuracy  on  the  transformed  data  corresponds 
to  minimizing  costs  on  the  target  data  (Breiman  et  al, 
1984).  Unfortunately,  this  strategy  is  impracticable  for 
conducting  empirical  research  based  on  our  benchmark 
data  sets.  First,  the  transformation  is  valid  only  for 
two-class  problems.  Whether  it  can  be  approximated 
effectively  for  multiclass  problems  is  an  open  question. 
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Second,  we  do  not  know  appropriate  costs  for  these 
data  sets  and,  as  noted  by  many  applied  researchers 
(Bradley,  1997;  Catlett,  1995;  Provost  and  Fawcett, 
1997),  assigning  these  costs  precisely  is  virtually  im¬ 
possible.  Third,  as  described  above,  generally  we  do 
not  know  whether  the  class  distribution  in  a  natural 
data  set  is  the  “true”  target  class  distribution. 

Because  of  these  uncertainties  we  cannot  claim  to  be 
able  to  transform  these  cost-minimization  problems 
into  accuracy-maximization  problems.  Moreover,  in 
many  cases  specifying  target  conditions  is  not  just 
virtually  impossible,  it  is  actually  impossible.  Of¬ 
ten  in  real-world  domains  there  are  no  “true”  tar¬ 
get  costs  and  class  distribution.  These  change  from 
time  to  time,  place  to  place,  and  situation  to  situation 
(Fawcett  and  Provost,  1997). 

Therefore  the  ability  to  transform  cost  minimization 
into  accuracy  maximization  does  not,  by  itself,  justify 
limiting  our  comparisons  to  classification  accuracy  on 
the  given  class  distribution.  However,  it  may  be  that 
comparisons  based  on  classification  accuracy  are  use¬ 
ful  because  they  are  indicative  of  a  broader  notion  of 
“better”  performance. 

3.2  ROC  ANALYSIS  AND  DOMINATING 
MODELS 

We  now  investigate  whether  an  algorithm  that  gen¬ 
erates  high-accuracy  classifiers  is  generally  better  be¬ 
cause  it  also  produces  low-cost  classifiers  for  the  target 
cost  scenario.  Without  target  cost  and  class  distribu¬ 
tion  information,  in  order  to  conclude  that  the  clas¬ 
sifier  with  higher  accuracy  is  the  better  classifier,  one 
must  show  that  it  performs  better  for  any  reasonable 
assumptions.  We  limit  our  investigation  to  two-class 
problems  because  the  analysis  is  straightforward. 

The  evaluation  framework  we  choose  is  Receiver  Op¬ 
erating  Characteristic  (ROC)  analysis  (Egan,  1975; 
Swets  and  Pickett,  1982;  Swets,  1988),  a  classic 
methodology  from  signal  detection  theory  that  is  now 
common  in  medical  diagnosis  (Beck  and  Schultz,  1986) 
and  has  recently  begun  to  be  used  more  generally  in 
AI  (Bradley,  1997;  Provost  and  Fawcett,  1997). 

We  briefly  review  some  of  the  basics  of  ROC  analy¬ 
sis.  ROC  space  denotes  the  coordinate  system  used 
for  visualizing  classifier  performance.  In  ROC  space, 
typically  the  true  positive  rate,  TP,  is  plotted  on  the  Y 
axis  and  the  false  positive  rate,  FP,  is  plotted  on  the  X 
axis.  Each  classifier  is  represented  by  the  point  in  ROC 
space  corresponding  to  its  {FP,TP)  pair.  For  models 
that  produce  a  continuous  output  {e.g.,  an  estimate  of 


the  posterior  probability  of  an  instance’s  class  mem¬ 
bership),  these  statistics  vary  together  as  a  threshold 
on  the  output  is  varied  between  its  extremes,  with 
each  threshold  value  defining  a  classifier.  The  result¬ 
ing  curve,  called  the  ROC  curve,  illustrates  the  error 
tradeoffs  available  with  a  given  model.  ROC  curves 
describe  the  predictive  behavior  of  a  classifier  inde¬ 
pendent  of  class  distributions  or  error  costs,  so  they 
decouple  classification  performance  from  these  factors. 

For  our  purposes,  a  crucial  notion  is  whether  one 
model  dominates  in  ROC  space,  meaning  that  all  other 
ROC  curves  are  beneath  it  or  equal  to  it.  A  dominat¬ 
ing  model  {e.g.,  model  NB  in  Figure  la)  is  at  least  as 
good  as  all  other  models  for  all  possible  cost  and  class 
distributions.  Therefore,  if  a  dominating  model  exists, 
it  can  be  considered  to  be  the  “best”  model  in  terms 
of  predictive  performance.  If  a  dominating  model  does 
not  exist  (as  in  Figure  lb),  then  none  of  the  models 
represented  is  best  under  all  target  scenarios;  in  such 
cases,  there  exist  scenarios  for  which  the  model  that 
maximizes  accuracy  (or  any  other  single-number  met¬ 
ric)  does  not  have  minimum  cost. 

Figure  1  shows  test-set  ROC  curves  on  two  of  the  UCI 
dom£dns  from  the  study  described  below.  Note  the 
“bumpiness”  of  the  ROC  curves  in  Figure  lb  (these 
were  two  of  the  largest  domains  with  the  least  bumpy 
ROC  curves).  This  bumpiness  is  typical  of  induction 
studies  using  ROC  curves  generated  from  a  hold-out 
test  set.  As  with  accuracy  estimates  based  on  a  sin¬ 
gle  hold-out  set,  these  ROC  curves  may  be  misleading 
because  we  cannot  tell  how  much  of  the  observed  vari¬ 
ation  is  due  to  the  particular  training/test  partition. 
Thus  it  is  difficult  to  draw  strong  conclusions  about  the 
expected  behavior  of  the  learned  models.  We  would 
like  to  conduct  ROC  analysis  using  cross-validation. 

Bradley  (1997)  produced  ROC  curves  from  10-fold 
cross  validation,  but  they  are  similarly  bumpy. 
Bradley  generated  the  curves  using  a  technique  known 
as  pooling.  In  pooling,  the  ith  points  making  up  each 
raw  ROC  curve  are  averaged.  Unfortunately,  as  dis¬ 
cussed  by  Swets  and  Pickett  (1982),  pooling  assumes 
that  the  ith  points  from  all  the  curves  are  actually  esti¬ 
mating  the  same  point  in  ROC  space,  which  is  doubtful 
given  Bradley’s  method  of  generating  curves.^  For  our 
study  it  is  important  to  have  a  good  approximation  of 
the  expected  ROC  curve. 

We  generate  results  from  10-fold  cross-validation  using 
a  different  methodology,  called  averaging.  Rather  than 


^Bradley  acknowledges  this  fact,  and  it  is  not  germane 
to  his  study.  However,  it  is  problematic  for  us. 
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(a)  Adult  (b)  Satimage 

Figure  1:  Raw  (un-averaged)  ROC  curves  from  two  UCI  database  domains 


using  the  averaging  procedure  recommended  by  Swets 
and  Pickett,  which  assumes  normal-fitted  ROC  curves 
in  a  binormal  ROC  space,  we  average  the  ROC  curves 
in  the  following  manner.  For  A;-fold  cross-validation, 
the  ROC  curve  from  each  of  the  k  folds  is  treated 
as  a  function,  Ri,  such  that  TP  =  Ri{FP).  This 
is  done  with  linear  interpolations  between  points  in 
ROC  space^  (if  there  are  multiple  points  with  the 
same  FP,  the  one  with  the  maximum  TP  is  chosen). 
The  averaged  ROC  curve  is  the  function  R{FP)  = 
mean{Ri{FP)y  To  plot  averaged  ROC  curves  we 
sample  from  R  at  100  points  regularly  spaced  along 
the  jPP-axis.  We  compute  confidence  intervals  of  the 
mean  of  TP  using  the  common  assumption  of  a  bino¬ 
mial  distribution. 

3.3  DO  STANDARD  METHODS 

PRODUCE  DOMINATING  MODELS? 

We  can  now  state  precisely  a  basic  hypothesis  to  be  in¬ 
vestigated;  Our  standard  learning  algorithms  produce 
dominating  models  for  our  standard  benchmark  data 
sets.  If  this  hypothesis  is  true  (generally),  we  might 
conclude  that  the  algorithm  with  higher  accuracy  is 
generally  better,  regardless  of  target  costs  or  priors.^ 

^Note  that  classification  performance  anywhere  along  a 
line  segment  connecting  two  ROC  points  can  be  achieved 
by  randomly  selecting  classifications  (weighted  by  the  in¬ 
terpolation  proportion)  from  the  classifiers  defining  the 
endpoints. 

^However,  even  this  conclusion  has  problems.  Accuracy 
comparisons  may  select  a  non-dominating  classifier  because 
it  is  indistinguishable  at  the  point  of  comparison — yet  it 


If  the  hypothesis  is  not  true,  then  such  a  conclusion 
will  have  to  rely  on  a  different  justification.  We  now 
provide  an  experimental  study  of  this  hypothesis,  de¬ 
signed  as  follows. 

From  the  UCI  repository  we  chose  ten  datasets  that 
contained  at  least  250  instances,  but  for  which  the  ac¬ 
curacy  for  decision  trees  was  less  than  95%  (because 
the  ROC  curves  are  difficult  to  read  at  very  high  ac¬ 
curacies).  For  each  domain,  we  induced  classifiers  for 
the  minority  class  (for  Road  we  chose  the  class  Grass) . 
We  selected  several  inducers  from  MCC++  (Kohavi  et 
al,  1997):  a  decision  tree  learner  (MC4),  Naive  Bayes 
with  discretization  (NB),  A;-nearest  neighbor  for  sev¬ 
eral  k  values  (IBfc),  and  Bagged-MC4  (Breiman,  1996). 
MC4  is  similar  to  C4.5  (Quinlan,  1993);  probabilistic 
predictions  are  made  by  using  a  Laplace  correction  at 
the  leaves.  NB  discretizes  the  data  based  on  entropy 
minimization  (Dougherty  et  al,  1995)  and  then  builds 
the  Naive-Bayes  model  (Domingos  and  Pazzani,  1997). 
IBA;  votes  the  closest  k  neighbors;  each  neighbor  votes 
with  a  weight  equal  to  one  over  its  distance  from  the 
test  instance. 

The  averaged  ROC  curves  are  shown  in  Figures  2 
and  3.  For  only  one  (Vehicle)  of  these  ten  domains 
was  there  an  absolute  dominator.  In  general,  very  few 
of  the  100  runs  we  performed  (10  data  sets,  10  cross- 
validation  folds  each)  had  dominating  classifiers.  Some 
cases  are  very  close,  for  example  Adult  and  Waveform- 
21.  In  other  cases  a  curve  that  dominates  in  one  area 
of  ROC  space  is  dominated  in  another.  Therefore,  we 

may  be  much  worse  elsewhere. 
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(a)  Vehicle  (b)  Waveform-21 


(c)  DNA 


(d)  Adult 


Figure  2:  Smoothed  ROC  curves  from  UCI  database  domains 


can  refute  the  hypothesis  that  our  algorithms  produce 
(statistically  significantly)  dominating  classifiers. 

This  draws  into  question  claims  of  “algorithm  A  is  bet¬ 
ter  than  algorithm  B”  based  on  accuracy  comparison. 
In  order  to  draw  such  a  conclusion  in  the  absence  of 
target  costs  and  class  distributions,  the  ROC  curve  for 
algorithm  A  would  have  to  be  a  significant  dominator 
of  algorithm  B.  This  has  obvious  implications  for  ma¬ 
chine  learning  research. 

In  practical  situations,  often  a  weaker  claim  is  suffi¬ 
cient:  Algorithm  A  is  a  good  choice  because  it  is  at 
least  as  good  as  Algorithm  B  (i.e.,  their  accuracies 
are  not  significantly  different).  It  is  clear  that  this 
type  of  conclusion  also  is  not  justified.  In  many  do¬ 
mains,  curves  that  are  statistically  indistinguishable 


from  dominators  in  one  area  of  the  space  are  signifi¬ 
cantly  dominated  in  another.  Moreover,  in  practical 
situations  typically  comparisons  are  not  made  with 
the  wealth  of  classifiers  we  are  considering.  More  of¬ 
ten  only  a  few  classifiers  are  compared.  Considering 
general  pairwise  comparisons  of  algorithms,  there  are 
many  cases  where  each  model  in  a  pair  is  clearly  much 
better  than  the  other  in  different  regions  of  ROC  space. 
This  clearly  draws  into  question  the  use  of  single  num¬ 
ber  metrics  for  practical  algorithm  comparison,  unless 
these  metrics  are  based  on  precise  target  cost  and  class 
distribution  information. 


(a)  Breast  cancer 


(b)  CRX 


(c)  German 


(d)  Pima 


(e)  RoadGrass 


Figure  3:  Smoothed  ROC  curves  from  UCI  database  domains,  cont’d 
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3.4  CAN  STANDARD  METHODS  BE 

COERCED  TO  YIELD  DOMINATING 
ROC  CURVES? 


The  second  justification  for  using  accuracy  to  compare 
algorithms  is  subtly  different  from  the  first.  Specifi¬ 
cally,  it  allows  for  the  possibility  of  coercing  algorithms 
to  produce  different  behaviors  under  different  scenar¬ 
ios  (such  as  in  cost-sensitive  learning).  If  this  can  be 
done  well,  accuracy  comparisons  are  justified  by  argu¬ 
ing  that  for  a  given  domain,  the  algorithm  with  higher 
accuracy  will  also  be  the  algorithm  with  lower  cost  for 
all  reasonable  costs  and  class  distributions. 

Confirming  or  refuting  this  justification  completely  is 
beyond  the  scope  of  this  paper,  because  how  best  to  co¬ 
erce  algorithms  for  different  environmental  conditions 
is  an  open  question.  Even  the  straightforward  method 
of  stratifying  samples  has  not  been  evaluated  satisfac¬ 
torily.  We  propose  that  the  ROC  framework  outlined 
so  far,  with  a  minor  modification,  can  be  used  to  eval¬ 
uate  this  question  as  well. 

For  algorithms  that  may  produce  different  models  un¬ 
der  different  cost  and  class  distributions,  the  ROC 
methodology  as  stated  above  is  not  quite  adequate. 
We  must  be  able  to  evaluate  the  performance  of  the 
algorithm,  not  an  individual  model.  However,  one  can 
characterize  an  algorithm’s  performance  for  ROC  anal¬ 
ysis  by  producing  a  composite  curve  for  a  set  of  gen¬ 
erated  models.  This  can  be  done  using  pooling,  or  by 
using  the  convex  hull  of  the  ROC  curves  produced  by 
the  set  of  models,  as  described  in  detail  by  Provost 
and  Fawcett  (1997;  1998). 

We  can  now  form  a  hypothesis  for  our  second  potential 
justification;  Our  standard  learning  algorithms  pro¬ 
duce  dominating  ROC  curves  for  our  standard  bench¬ 
mark  data  sets.  Confirming  this  hypothesis  would  be 
an  important  step  in  justifying  the  common  practice  of 
ignoring  target  costs  and  class  distributions  in  classfier 
comparisons  on  natural  data.  Unfortunately,  we  know 
of  no  confirming  evidence. 

On  the  other  hand,  there  is  disconfirming  evidence. 
First,  consider  the  results  presented  above.  Naive 
Bayes  is  robust  with  respect  to  changes  in  costs — it 
will  produce  the  same  ROC  curve  regardless  of  the 
target  costs  and  class  distribution.  Furthermore,  it 
has  been  shown  that  decision  trees  are  surprisingly  ro¬ 
bust  if  the  probability  estimates  are  generated  with 
the  Laplace  estimate  (Bradford  et  ai,  1998).  If  this 
result  holds  generally,  the  results  in  the  previous  sec¬ 
tion  would  disconfirm  the  present  hypothesis  as  well. 


Second,  Bradley’s  (1997)  results  provide  disconfirming 
evidence.  Specifically,  he  studied  six  real-world  med¬ 
ical  data  sets  (four  from  the  UCI  repository  and  two 
from  other  sources).  Bradley  plotted  the  ROC  curves 
of  six  classifier  learning  algorithms,  consisting  of  two 
neural  nets,  two  decision  trees  and  two  statistical  tech¬ 
niques.  Bradley  uses  composite  ROC  curves  formed 
by  training  models  differently  for  different  cost  distri¬ 
butions.  We  have  previously  criticized  the  design  of 
his  study  for  the  purpose  of  answering  our  question. 
However,  if  the  results  can  be  replicated  under  the 
current  methodology,  they  would  make  a  strong  state¬ 
ment.  Not  one  of  the  six  data  sets  had  a  dominating 
classifier.  This  implies  that  for  each  domain  there  exist 
disjoint  sets  of  conditions  for  which  different  induction 
algorithms  are  preferable. 

4  RECOMMENDATIONS  AND 
LIMITATIONS 

When  designing  comparative  studies,  researchers 
should  be  clear  about  the  conclusions  they  want  to 
be  able  to  draw  from  the  results.  We  have  argued 
that  comparisons  of  algorithms  based  on  accuracy  are 
unsatisfactory  when  there  is  no  dominating  classifier. 
However,  presenting  the  case  against  the  use  of  accu¬ 
racy  is  only  one  of  our  goals.  We  also  want  to  show 
how  precise  comparisons  still  can  be  made,  even  when 
the  target  cost  and  class  distributions  are  not  known. 

If  there  is  no  dominator,  conclusions  must  be  quali¬ 
fied.  No  single  number  metric  can  be  used  to  make 
very  strong  conclusions  without  domain-specific  infor¬ 
mation.  However,  it  is  possible  to  look  at  ranges  of 
costs  and  class  distributions  for  which  each  classifier 
dominates.  The  problems  of  cost-sensitive  classifica¬ 
tion  and  learning  with  skewed  class  distributions  can 
be  analyzed  precisely. 

Even  without  knowledge  of  target  conditions,  a  pre¬ 
cise,  concise,  robust  specification  of  classifier  perfor¬ 
mance  can  be  made.  As  described  in  detail  by  Provost 
and  Fawcett  (1997),  the  slopes  of  the  lines  tangent  to 
the  ROC  convex  hull  determine  the  ranges  of  costs 
and  class  distributions  for  which  particular  classifiers 
minimize  cost.  For  specific  target  conditions,  the  cor¬ 
responding  slope  is  the  cost  ratio  times  the  reciprocal 
of  the  class  ratio.  For  our  ten  domains,  the  optimal 
classifiers  for  different  target  conditions  are  given  in 
Table  1.  For  example,  in  the  Road  domain  (see  Fig¬ 
ure  3  and  Table  1),  Naive  Bayes  is  the  best  classifier 
for  any  target  conditions  corresponding  to  a  slope  less 
than  0.38,  and  Bagged-MC4  is  best  for  slopes  greater 
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Table  1:  Locally  dominating  classifiers  for  ten  UCI  domains 


Domain 

Slope  range 

Dominator 

Domain 

Slope  range 

Dominator 

Adult 

0,  7.72] 

7.72,  21.6] 
21.6,  oo) 

NB 

Bagged-MC4 

NB 

Pima 

0,  0.06] 
0.06,  0.11 
0.11,  0.30 
0.30,  0.82 
0.82,  1.13 
1.13,  4.79 
4.79,  oo) 

NB 

Bagged-MC4 

NB 

Bagged-MC4 

NB 

Bagged- MC4 
NB 

Breast 

cancer 

0,  0.37] 

0.37,  0.5] 

0.5,  1.34] 

1.34,  2.38] 

2.38,  oo) 

- 

IB3 

IB5 

IB3 

Bagged-MC4 

Satimage 

0,  0.05] 

- 

Bagged-MC4 

IB5 

IB3 

IB5 

IB3 

Bagged- MC4 

CRX 

0,  0.03] 

0.03,  0.06] 

0.06,  2.06] 

2.06,  oo) 

Bagged- MC4 
NB 

Bagged-MC4 

NB 

0.05,  0.22 
0.22,  2.60 
2.60,  3.11 
3.11,  7.54 

German 

0,  0.21] 

0.21,  0.47] 

0.47,  3.08] 

3.08,  oo) 

TTB - 

Bagged- MC4 
NB 

IB5 

7.54,  31.14] 
31.14,  oo) 

Waveform 

21 

0,  0.25] 
0.25,  4.51 
4.51,  6.12 
6.12,  oo) 

“NB - 

Bagged-MC4 

IB5 

Bagged-MC4 

Road 

(Grass) 

0,  0.38] 

0.38,  oo) 

T7B - 

Bagged-MC4 

DNA 

'O,  1.06] 

1.06,  oo) 

NB 

Bagged- MC4 

Vehicle 

0,  oo) 

Bagged-MC4 

than  0.38.  They  perform  equally  well  at  0.38.  We 
admit  that  this  is  not  as  elegant  as  a  single-number 
comparison,  but  we  believe  it  to  be  much  more  useful, 
both  for  research  and  in  practice. 

In  summary,  if  a  dominating  classifier  does  not  exist 
and  cost  and  class  distribution  information  is  unavail¬ 
able,  no  strong  statement  about  classifier  superiority 
can  be  made.  However,  one  might  be  able  to  make 
precise  statements  of  superiority  for  specific  regions  of 
ROC  space.  For  example,  if  all  you  know  is  that  few 
false  positive  errors  can  be  tolerated,  you  may  be  able 
to  find  a  particular  algorithm  that  is  superior  at  the 
“far  left”  edge  of  ROC  space. 

We  limited  our  investigation  to  two  classes.  This  does 
not  affect  our  conclusions  since  our  results  are  nega¬ 
tive.  However,  since  we  are  also  recommending  an  an¬ 
alytical  framework,  we  note  that  extending  our  work 
to  multiple  dimensions  is  an  interesting  open  problem. 

Finally,  we  are  not  completely  satisfied  with  our 
method  of  generating  confidence  intervals.  The 
present  intervals  are  appropriate  for  the  Neyman- 
Pearson  observer  (Egan,  1975),  which  wants  to  max¬ 
imize  TP  for  a  given  FP.  However,  their  appropriate¬ 
ness  is  questionable  for  evaluating  minimum  expected 
cost,  for  which  a  given  set  of  costs  contours  ROC  space 
with  lines  of  a  particular  slope.  Although  this  is  an 
area  of  future  work,  it  is  not  a  fundamental  drawback 
to  the  methodology. 


5  CONCLUSIONS 


We  have  offered  for  debate  the  justification  for  the  use 
of  accuracy  estimation  as  the  primary  metric  for  com¬ 
paring  algorithms  on  our  benchmark  data  sets.  We 
have  elucidated  what  we  believe  to  be  the  top  can¬ 
didates  for  such  a  justification,  and  have  shown  that 
either  they  are  not  realistic  because  we  cannot  specify 
cost  and  class  distributions  precisely,  or  they  are  not 
supported  by  experimental  evidence. 

We  draw  two  conclusions  from  this  work.  First,  the 
justifications  for  using  accuracy  to  compare  classifiers 
are  questionable  at  best.  Second,  we  have  described 
what  we  believe  to  be  the  proper  use  of  ROC  analysis 
as  applied  to  comparative  studies  in  machine  learning 
research.  ROC  analysis  is  not  as  simple  as  compar¬ 
ing  with  a  single-number  metric.  However,  we  believe 
that  the  additional  power  it  delivers  is  well  worth  the 
effort.  In  certain  situations,  ROC  analysis  allows  very 
strong,  general  conclusions  to  be  made — both  positive 
and  negative.  In  situations  where  strong,  general  con¬ 
clusions  cannot  be  made,  ROC  analysis  allows  very 
precise  analysis  to  be  conducted. 

Although  ROC  analysis  is  not  new,  in  machine  learn¬ 
ing  research  it  has  not  been  applied  in  a  principled 
manner,  geared  to  the  specific  conclusions  machine 
learning  researchers  would  like  to  draw.  We  hope  that 
this  work  makes  significant  progress  toward  that  goal. 
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Abstract 

While  there  has  been  a  growing  interest  in  the  problem  of 
learning  Bayesian  networks  from  data,  no  technique  exists 
for  learning  or  revising  Bayesian  networks  with  hidden  vari¬ 
ables  (i.e.  variables  not  represented  in  the  data),  that  has 
been  shown  to  be  efficient,  effective,  and  scalable  through 
evaluation  on  real  data.  The  few  techniques  that  exist  for 
revising  such  networks  perform  a  blind  search  through  a 
large  space  of  revisions,  and  are  therefore  computationally 
expensive.  This  paper  presents  Banner,  a  technique  for 
using  data  to  revise  a  given  Bayesian  network  with  noisy-or 
and  noisy-aud  nodes,  to  improve  its  classification  accuracy. 
The  initial  network  can  be  derived  directly  from  a  logical 
theory  expressed  as  propositional  rules.  Banner  can  revise 
networks  with  hidden  variables,  and  add  hidden  variables 
when  necessary.  Unlike  previous  approaches.  Banner  em¬ 
ploys  mechanisms  similar  to  logical  theory  refinement  tech¬ 
niques  for  using  the  data  to  focus  the  search  for  effective 
modifications.  Experiments  on  real-world  problems  in  the 
domain  of  molecular  biology  demonstrate  that  Banner  can 
effectively  revise  fairly  large  networks  to  significantly  im¬ 
prove  their  accuracies. 

1  Introduction 

Bayesian  networks  have  become  the  most  popular  ap¬ 
proach  to  uncert^un  reasoning  due  to  their  precise 
probabilistic  semantics  as  well  their  success  in  practi¬ 
cal  applications.  In  an  attempt  to  automate  their  con¬ 
struction,  induction  of  Bayes  nets  has  become  a  topic 
of  increasing  interest.  A  number  of  learning  methods 
have  been  developed  for  the  case  where  all  relevant 
variables  are  observable  (Heckerman,  1995).  Param¬ 
eter  learning  methods  for  networks  with  hidden  vari¬ 
ables  (variables  not  represented  in  the  data)  have  also 
been  developed  (Russell,  Binder,  Roller,  &  Kanazawa, 
1995;  Thiesson,  1995).  However,  learning  both  the 
structure  and  the  parameters  of  a  Bayesian  network 
with  hidden  variables  remains  a  problem.  Many  of 
the  existing  methods  can  be  adapted  to  discover  hid¬ 
den  variables,  but  only  by  conducting  extensive  search 


that  is  impractical  for  most  problems.  A  recent  devel¬ 
opment  is  MS-EM  (Friedman,  1997),  which  learns  the 
structure  of  a  network  with  hidden  variables;  however, 
it  requires  specifying  the  number  of  hidden  variables 
and  has  not  been  tested  on  real  data. 

As  demonstrated  by  theory  refinement  research  on 
rule-bases,  using  empirical  data  to  revise  an  initial  im¬ 
perfect  knowledge  base  can  significantly  improve  per¬ 
formance  over  induction  from  scratch  (Opitz  &  Shav- 
lik,  1993;  Ourston  k  Mooney,  1994;  Towell  k  Shav- 
lik,  1994;  Mahoney  k  Mooney,  1994;  Brunk  k  Paz- 
zani,  1995).  A  few  techniques  have  been  developed 
for  revising  Bayesian  networks  (Lam  k  Bacchus,  1994; 
Buntine,  1991);  however,  they  do  not  handle  hidden 
variables.  Many  existing  Bayes-net  induction  meth¬ 
ods  could  be  adapted  to  revision,  but  only  by  examin¬ 
ing  all  possible  individu£d  modifications.  By  contrast, 
rule-revision  systems  use  classification  errors  on  the 
training  data  to  propose  specific  modifications  rather 
than  blindly  examining  all  possible  options.  The  result 
is  an  efficient,  directed  revision  process. 

We  have  developed  a  technique.  Banner,  for  refin¬ 
ing  Bayesian  networks  with  hidden  variables  that,  like 
rule-refinement  eilgorithms,  uses  the  data  to  focus  the 
search  for  effective  modifications.  Banner’s  goal  is  to 
improve  the  accuracy  of  an  initial  network  for  a  spe¬ 
cific  inference  task  by  modifying  both  its  parameters 
and  structure,  including  adding  new  hidden  variables. 
Although  Bayesian  networks  c^ln  simult^lneously  sup¬ 
port  many  types  of  inference,  training  directly  for 
the  desired  classification  task  results  in  better  perfor¬ 
mance  (Friedman  k  Goldszmidt,  1996).  Since  gen¬ 
eral  Bayesian  networks  cire  impractical  for  many  large 
problems  because  the  number  of  parameters  grows  ex¬ 
ponentially  in  the  fan-in  of  a  node,  we  focus  on  net¬ 
works  with  noisy-or  and  noisy-and  nodes,  specialized 
models  that  require  only  a  linear  number  of  param- 


Theory  Refinement  of  Bayesian  Networks  with  Hidden  Variables  455 


eters  (Pearl,  1988;  Pradhan,  Provan,  Middleton,  & 
Henrion,  1994).  Since  these  models  are  close  to  logical 
functions,  they  also  allow  a  rule-base  to  be  used  as  an 
initial  theory  by  mapping  the  rules  to  a  network  in  the 
obvious  way.  Existing  results  show  that  the  accuracy 
of  rule  bases  can  be  dramatically  improved  by  mapping 
them  to  a  representation  that  provides  numerical  sum¬ 
ming  of  evidence  (Towell  &  Shavlik,  1994;  Mahoney 
&  Mooney,  1994).  However,  the  neural  networks  or 
certainty-factor  rules  employed  in  these  results  do  not 
provide  an  interpretable  knowledge  base  with  parame¬ 
ters  that  have  a  precise  semantics.  An  important  goal 
of  theory  refinement  is  to  provide  interpretable  knowl¬ 
edge,  and  we  believe  Bayes  nets  are  preferable  in  this 
regard. 

Experimental  evaluation  of  Bayes  net  learning  has 
largely  been  conducted  on  artificial  data  and  not  ade¬ 
quately  compared  to  other  methods  on  real  problems 
(exceptions  include  Provan  and  Singh  (1994),  Fried¬ 
man  and  Goldszmidt  (1996)),  and  we  know  of  no 
Bayes-net  results  on  revising  real  knowledge  bases  to 
fit  actual  data.  We  have  evaluated  Banner  on  several 
realistic  problems  used  to  test  other  theory  refinement 
systems,  obtaining  performance  competitive  with  the 
current  best  results  while  maintaining  the  advantages 
of  a  Bayes-net  representation.  The  remainder  of  the 
paper  presents  an  overview  of  Banner’s  learning  al¬ 
gorithm  and  the  promising  results  of  this  evaluation. 

2  Refinement  Algorithm 

As  in  general  in  theory  refinement,  the  goal  is  to  min¬ 
imally  modify  the  initial  theory  to  make  it  consistent 
with  the  available  training  data.  Taking  the  standard 
approach.  Banner  employs  one  procedure  to  revise 
the  parameters  of  a  network  and  another  to  revise  the 
structure.  First,  the  parameters  are  revised  to  im¬ 
prove  classification  accuracy.  If  the  resulting  network 
does  not  adequately  fit  the  training  data,  the  struc¬ 
ture  of  the  network  is  modified  and  the  parameters 
are  retrained.  This  process  repeats  imtil  it  is  deter¬ 
mined  that  additional!  training  results  in  over-fitting.^ 
In  this  paper,  we  focus  on  structure  revision.  Our  cur¬ 
rent  implementation  includes  two  parameter  revision 
cdgorithms,  Banner-Pr  (Ramacheindran  &  Mooney, 
1996)  and  C-APN  (based  on  (Russell  et  al.,  1995)), 
which  use  different  forms  of  gradient  descent.  Ra- 
machandran  (1998)  presents  further  details. 

*^The  parameter  revision  component  uses  lO-fold  inter¬ 
nal  cross-validation  on  the  training  set  to  determine  when 
to  stop  (Mitchell,  1997). 


Structure  revision  exploits  the  idea  that  networks  with 
noisy-or/and  nodes  are  similar  to  logical  theories  and 
therefore  techniques  used  to  revise  rule  bases  axe  use¬ 
ful.  These  methods  attribute  classification  errors  on 
particular  examples  to  specific  portions  of  the  theory 
and  directly  construct  revisions  to  handle  the  mis- 
classified  cases.  Most  logical  refinement  systems  use 
abduction  to  diagnose  faults  (Mooney,  1997).  Since 
Bayesian  networks  place  no  restrictions  on  the  direc¬ 
tion  of  inference,  abduction  can  be  performed  using  the 
standard  inference  algorithms.  In  addition,  leak  nodes 
(Pradhan  et  al.,  1994)  provide  a  way  to  model  the  in¬ 
completeness  and  incorrectness  of  a  Bayesian  network 
with  noisy-or/and  nodes.  A  leak  node  is  a  source  in 
the  graph  added  as  an  extra  input  to  a  node  in  order 
to  represent  a  possible  unknown  cause.  Banner  diag¬ 
noses  faults  in  a  network  by  temporarily  instrumenting 
each  node  with  leak  nodes  that  indicate  potential  re¬ 
vision  points.  It  then  uses  training  data  to  select  a 
small  set  of  revision  points  and  construct  appropriate 
refinements. 

2.1  Selecting  Revision  Points 

The  procedure  for  instrumenting  a  network  with  leak 
nodes  is  best  illustrated  with  an  example,  such  as  that 
shown  in  Figure  1  (A-G  are  the  original  nodes).  Each 
noisy-or/and  has  an  added  parent  called  a  node-leak 
node.  In  order  to  avoid  significantly  altering  the  se¬ 
mantics  of  the  net,  the  prior  of  the  leak  node  and  its 
link  parameter  are  initially  set  very  low.  However, 
when  the  algorithm  detects  misclassifications,  it  re- 
estimates  the  prior  probabilities  by  training  a  copy  of 
the  network  augmented  with  leak  node-leak  nodes  us¬ 
ing  the  parameter  revision  module.  All  of  the  orig¬ 
inal  noisy-or  (noisy-and)  nodes  also  have  their  par¬ 
ents  routed  through  an  intervening  noisy-and  (noisy- 
or)  node.  The  intervening  nodes  themselves  have  at¬ 
tached  leak  nodes  ceilled  link-leak  nodes.  To  avoid  al¬ 
tering  the  semantics,  the  weights  on  the  links  are  set 
to  simulate  logical  functions  and  the  prior  probability 
of  the  link-leak  node  is  set  to  the  weight  on  the  origi¬ 
nal  link.  The  leak  nodes  effectively  represent  possible 
faults  in  the  theory,  with  node-leak  nodes  representing 
the  need  for  new  inputs  to  a  node,  and  link-leak  nodes 
representing  the  need  for  new  intervening  hidden  vari¬ 
ables  between  two  nodes. 

Once  the  network  is  properly  instrumented.  Banner 
performs  abduction  on  each  misclassified  example  to 
generate  a  set  of  repairs  that  could  correct  the  ex¬ 
ample.  This  involves  instantiating  both  the  evidence 
and  the  target  variables  in  the  augmented  network  to 
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Figure  1:  Augmenting  a  network  with  leak  nodes 


their  observed  values  and  inferring  the  beliefe  asso¬ 
ciated  with  the  leak  nodes  using  standard  Bayesian 
inference.  For  each  misclassified  example,  it  collects  a 
set  of  leak  nodes,  whose  beliefs  deviate  from  their  prior 
probability  by  more  than  10%.  Such  leak  nodes  are 
said  to  cover  the  example,  and  indicate  potential  revi¬ 
sion  points  in  the  theory.  When  the  belief  in  the  truth 
of  a  leak  node  decreases  from  its  prior,  it  is  called  cin 
inhibitor  for  that  example;  if  it  increases,  it  is  called 
an  enabler.  Each  leak  node  covering  an  example  is 
associated  with  the  degree  to  which  its  belief  devi¬ 
ated  from  its  prior,  indicating  the  extent  to  which  it  is 
blamed  for  the  misclassification.  Once  leak  nodes  are 
collected  for  all  misclassified  examples.  Banner  uses 
a  greedy  set  covering  algorithm  (where  the  contribu¬ 
tion  of  each  leak  node  is  weighted  by  its  degree)  to 
generate  a  small  set  of  leak  nodes  that  cover  all  of  the 
misclassified  examples.  While  Banner  uses  only  mis¬ 
classified  examples  to  generate  a  set  of  revision  points, 
it  performs  abduction  on  all  the  examples,  generating 
leak  nodes  that  are  enablers  or  inhibitors  for  each  ex¬ 
ample.  This  information  is  used  during  the  generation 
of  appropriate  revisions. 

2.2  Revision  Operators 

For  each  revision  point  in  the  covering  set,  Banner 
implements  one  of  the  following  modifications  to  help 
correct  the  misclassified  examples  covered  by  the  cor¬ 
responding  leak  node;  1)  Add  a  new  parent,  2)  Add  a 
new  hidden  node,  3)  Delete  a  link.  The  first  operator 


is  invoked  when  a  revision  point  is  a  node-leak  node, 
in  which  case  it  adds  a  new  parent  to  the  appropriate 
node  in  the  original  network.  In  the  example,  if  G- 
leak  is  a  selected  revision  point,  then  a  new  parent  is 
added  to  G.  The  heuristic  for  selecting  the  new  parent 
is  discussed  below. 

If  a  revision  point  is  a  link-leak  node.  Banner  modi¬ 
fies  the  corresponding  link.  One  option  is  to  introduce 
a  new  hidden  variable  with  an  additional  peirent  and 
the  same  type  as  the  corresponding  intervening  node. 
In  the  example,  if  E-A-leak  is  the  revision  point,  a  new 
noisy-or  node  is  added  between  E  and  A  (see  Figure  2). 
The  rationale  for  such  a  revision  is  that  the  previous 
step  of  abduction  with  the  augmented  network  indi¬ 
cated  that  such  a  structure  would  better  explain  the 
misclassified  data. 

However,  in  some  cases,  the  problematic  link  is  simply 
deleted.  For  example,  if  E-A-leak  is  an  enabler  for  sev¬ 
eral  excimples  but  never  an  inhibitor,  the  link  may  be 
deleted  to  correct  the  misclassified  examples  without 
affecting  other  examples  since  the  link  is  effectively  ein 
always-true  input  to  a  noisy-and  which  therefore  has 
no  effect.  A  dual  argument  can  be  made  for  noisy-or 
nodes.  A  link  is  also  deleted  if,  when  a  hidden  node 
is  added,  the  chosen  parent  has  the  same  effect  as  link 
deletion.  For  example,  if  the  negation  of  A  is  chosen 
as  the  new  parent  of  E-A,  the  link  between  E  and  A 
is  deleted. 

New  parents  are  selected  based  on  the  examples  for 
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Given:  An  initial  network,  and  a  set  of  training  data.  Output:  A  revised  network. 

1.  Initialize  the  parameters  of  the  network  either  randomly  or  based  on  some  prior  knowledge. 

2.  Repeat  steps  a-e  until  there  is  no  improvement  in  training  acctiracy  over  a  pre-specified  number  of 
consecutive  cycles. 

(a)  set  train-net  =  initial  network. 

(bj  set  leak-net  =  train-net  augmented  with  node-leak  nodes. 

(c)  Train  network  train-net  to  revise  parameters.  ,  j  •  . 

fd)  If  the  previous  step  indicates  overiitting,  or  all  examples  are  correctly  classified,  return  train-net. 
\e)  else 

i.  Tirain  network  leak-net  to  estimate  prior  probabilities  of  the  node-leak  nodes. 

ii.  Set  augmented-net  =  train-net  augmented  with  node-leak  and  link-leak  nodes. 

iii.  Copy  priors  of  leak  nodes  firom  leak-net  to  augmented-net. 


iv.  For  each  example, 

A.  Instantiate  input  and  target  nodes  of  augmented-net  with  values  from  the  example. 

B.  Infer  beliefr  of  all  the  nodes  in  augmented-net. 

C.  Collect  all  enabled  and  inhibited  node-leak  and  link-leak  nodes. 

V.  Set  revision-points  =  small  set  of  node-leak  and  link-leak  nodes  that  cover  all  the  misclassified 
examples  (computed  using  greedy  set  covering) 

vi.  For  each  revision  point  in  revision-points,  revise  train-net  at  the  revision  point  using  one  of  the 
revision  operators. 


Figure  3:  Outline  of  the  Refinement  Algorithm 


Figure  2:  Revision  operator;  Adding  a  hidden  node 


which  the  chosen  leak-node  is  an  enabler  or  inhibitor. 
The  new  parent  needs  to  be  true  for  the  examples 
it  must  enable  and  false  for  the  ones  it  must  in¬ 
hibit.  Banner  uses  a  standard  information  gain  met¬ 
ric  (Quinlan,  1990)  to  choose  a  parent  that  best  dis¬ 
criminates  between  these  two  sets  of  examples.  This 
metric,  commonly  used  in  inductive  learning  algo¬ 
rithms  (Madioney  &  Mooney,  1994;  Quinlan,  1990, 
1986),  estimates  the  information  gained  about  a  target 
function  value  from  knowing  the  value  of  an  attribute. 
Two  versions  of  this  metric  that  are  commonly  used. 
The  version  used  by  'Quinlan  (1990)  to  lezim  proposi¬ 
tional  Horn-clause  theories,  is  designed  to  pick  a  fea¬ 
ture  that  best  discriminates  between  sets  of  examples, 
with  the  additional  constraint  that  the  feature  have 


specific  values  (e.g.  true  or  false)  for  each  set  of  exam¬ 
ples.  This  version  is  most  appropriate  for  our  theory 
refinement  algorithm  because  we  need  to  select  a  new 
parent  that  discriminates  between  the  examples  that 
need  an  enabling  influence,  and  the  examples  that  need 
an  inhibitory  influence,  with  the  additional  constraint 
that  the  new  parent  be  true  for  the  former  set  of  ex¬ 
amples  and  false  for  the  latter  set  of  examples. 

Suppose  that  we  are  given  a  set  of  examples,  5,  of 
size  N,  of  which  JV+  a  re  positive  examples  of  a  given 
class  C,  and  N~  are  negative  examples  of  C.  Also  as¬ 
sume  that  all  the  features  in  the  examples  are  boolean¬ 
valued.  For  any  given  feature  F,  let  Nf  be  the  number 
of  examples  for  which  F  is  true;  of  these  let,  be 
the  number  of  examples  which  are  positive  examples 
of  C,  and  Nf  be  the  niunber  of  examples  which  are 
negative  examples  of  C.  Then,  the  reduction  due  to 
F  in  the  total  number  of  bits  required  to  encode  the 
positive  members  of  (7  is  given  by 

Gain{C,  F)  =  N+*  (1(8)  -  I{Nf)), 

where  I{S)  =  -log2  is  the  number  of  bits 

required  to  encode  a  positive  member  of  class  C,  amd 
I{Nf)  =  -log2  is  number  of  bits  re¬ 

quired  to  encode  the  positive  members  of  class  C,  given 
that  F  is  true.  The  higher  the  vadue  of  this  func¬ 
tion,  the  greater  the  correlation  between  the  examples 
for  which  F  is  true  and  the  positive  examples  of  C. 
Note  that  this  computation  can  be  easily  generailized 
to  hidden  variables  amd  variables  with  missing  values. 
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Information  gain  for  such  nodes  can  be  obtained  by 
weighting  the  frequency  measures  N'^  and  NJ  by  the 
degree  of  belief  associated  with  these  nodes  for  each 
example. 

So  far,  we  have  described  this  metric  with  a  view  to 
selecting  an  enabling  parent.  The  same  metric  is  used 
to  select  an  inhibitory  parent  by  defining  iV/  to  be  the 
number  of  examples  for  which  F  is  false.  Every  other 
term  in  the  computation  of  the  metric  is  defined  as 
before.  In  general,  all  nodes  in  the  network  and  their 
negations  are  potential  candidates;  however,  to  avoid 
redundancy  and  the  introduction  of  loops,  the  existing 
parents  and  descendents  of  the  recipient  of  the  new 
parent  are  excluded.  Figure  3  shows  a  summary  of  the 
overall  algorithm. 

3  Experimental  Evaluation 

We  conducted  experiments  on  realistic  problems  and 
data  to  demonstrate  that  Banner  is  effective  at  re¬ 
vising  networks  to  improve  their  classification  accu¬ 
racy.  We  also  compared  its  performance  to  naive  Bayes 
which  learns  a  simple  Bayes  net  that  includes  all  fea¬ 
tures  and  assumes  conditional  independence,*  with 
Kbann  (Towell  &  Shavlik,  1994)  a  neural-network 
refinement  method.  Rapture  (Mahoney  &  Mooney, 
1994)  a  certmnty-factor  refinement  method,  and  with 
two  standard  inductive  algorithms:  C4.5  (Quinlan, 
1993)  for  decision  trees  and  Backprop  (McClelland 
k  Rumelhart,  1988)  for  neural  networks.  In  order 
to  study  the  contribution  of  Banner’s  components, 
we  also  performed  ablation  studies,  where  we  disabled 
parts  of  the  algorithm  and  compared  performance  to 
the  full  system.  Banner-Ind,  is  an  inductive  version 
which  does  not  utilize  an  initial  theory  but  starts  with 
a  default  network  with  input  and  output  variables  but 
no  links,  and  Banner-Pr  (parameter  revision),  which 
uses  zin  initial  theory  but  does  not  perform  structure 
revision.  Finally,  we  specifically  evaluated  structure 
revision  by  attempting  to  fix  an  artificially  corrupted 
initial  theory. 

We  present  results  on  two  molecular  biology  problems 
employed  in  previous  refinement  experiments:  recog¬ 
nizing  promoters  and  splice-junctions  in  DNA  strands 
(Towell  &  Shavlik,  1994).  These  problems  include  im¬ 
perfect,  expert-provided  theories  represented  as  propo¬ 
sitional  rules.  These  theories  contain  fan-ins  of  up 
to  17  inputs,  which  would  require  more  than  130,000 

*Our  version  includes  smoothing  with  Laplace  estimates 
which  significantly  improves  performance  (Kohavi,  Becker, 
&  Sommerfield,  1997) 


parameters  for  general  nodes,  demonstrating  the  im¬ 
portance  of  using  noisy-or/ands.  Here  we  present  the 
splice-junction  results  and  results  on  a  corrupted  ver¬ 
sion  of  the  promoter  theory.  Banner  also  performs 
well  on  revising  the  original  promoter  theory,  but  since 
its  structure  is  already  adequate,  this  problem  does 
not  test  structure  revision.  The  system  also  performed 
well  on  revising  a  knowledge  base  on  C-l— I-  program¬ 
ming  to  model  students  for  an  intelligent  tutoring  sys¬ 
tem  (Baffes  &  Mooney,  1996).  Ramachandran  (1998) 
presents  complete  results. 

In  order  to  compare  to  previous  results,  we  generated 
learning  curves  in  which  the  data  was  randomly  split 
into  independent  training  and  test  sets,  systems  were 
trained  on  the  training  data,  and  then  tested  on  classi¬ 
fying  the  test  examples.  Results  were  averaged  over  20 
random  training/test  splits.  This  was  done  for  training 
sets  with  increasing  number  of  examples.  A  two-tailed 
paired  t-test  is  used  to  evaluate  the  statistical  signif¬ 
icance  of  differences  in  performance  given  a  specific 
number  of  training  examples. 

3.1  DNA  Splice- Junction 

This  problem  addresses  the  task  of  detecting  splice- 
junctions,  the  boundaries  between  the  utilized  and  un¬ 
utilized  sequences  in  DNA.  The  data  set  consists  of 
3190  examples  consisting  of  strings  of  60  nucleotides 
with  the  values  A,  C,  G,  or  T,  emd  assigned  to  three 
different  categories.  The  initiail  theory  consists  of  47 
propositionail  rules. 

Figures  4  shows  the  primeiry  results  and  Figure  5  shows 
the  ablation  results.  The  experiment  provides  evi¬ 
dence  that  Banner  is  successful  at  improving  the  ac¬ 
curacy  of  the  initieil  theory  significantly  with  just  a 
smcill  number  of  examples.  The  accuracy  of  the  initial 
theory  has  risen  from  55%,  before  revision,  to  73.6% 
when  trained  on  just  20  examples,  and  to  about  91.2% 
when  trained  on  400  examples.  The  performance  of 
the  three  refinement  algorithms  Rapture,  Banner, 
and  Kbann  are  similar,  although  Rapture  performs 
slightly  better.  The  differences  between  Rapture  eind 
Banner  eire  small  but  statistically  significant  for  all 
points  on  the  leEirning  curve  at  the  0.01  level.  The  in¬ 
ductive  algorithms  all  perform  significantly  worse  for 
smellier  training  sets,  although  Naive  Bayes  catches 
up  with  Rapture  at  200  examples.  The  differences 
between  the  Banner  and  Naive  Bayes  are  signifi¬ 
cant  at  at  least  the  0.01  level  for  20,  50,  and  100  exam¬ 
ples,  where  the  former  performs  considerably  better, 
at  the  0.001  level  for  400  examples  where  it  performs 
slightly  worse. 
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Number  of  training  examples 

Figure  4:  Splice-Junction:  Performance  of  Various 
Systems 

Figure  5  demonstrates  that  structure  revision  con¬ 
tributes  significantly  to  Banner’s  performance  on 
smaller  training  sets.  Structure  revision  has  con¬ 
tributed  to  an  improvement  in  accuracy  of  about  13% 
over  Banner- Pr  for  20  examples  (significant  at  0.001 
level),  and  an  improvement  of  about  2.8%  for  50  ex¬ 
amples  (significant  at  the  0.05  level).  The  revisions 
that  contributed  the  most  to  this  improvement  were 
deletions  of  the  links  between  nodes  IE  and  PR,  and 
nodes  El  and  P5G.  The  differences  between  Banner 
and  Banner-Pr  are  not  statistically  significant  at  the 
rest  of  the  points  on  the  learning  curve.  As  expected, 
starting  out  with  an  initial  theory  gives  Banner  a 
significant  edge  over  Banner-Ind.  The  difierence  in 
performeince  between  these  systems  is  statistically  sig¬ 
nificant  for  all  points  on  the  learning  curves,  except  at 
100  example,  at  levels  of  at  least  0.02. 

3.2  Evaluation  of  Structure  Revision  on 
DNA  Promoter 

In  order  to  more  directly  study  structure  revision,  an 
existing  theory  with  adequate  structure  was  corrupted 
and  Banner’s  ability  to  recover  the  lost  structure  was 
examined.  The  DNA  promoter  recognition  problem 
involves  identifying  DNA  sequences  that  indicate  the 
start  of  a  new  gene.  Figure  6  shows  a  portion  of  the 
Bayesian  network  derived  from  the  initial  theory  for 


Figure  5:  Splice-Junction:  Banner  Ablations 


this  problem.  The  data  set  contains  468  examples,  con¬ 
sisting  of  strings  of  57  nucleotides  classified  as  pro¬ 
moters  or  non-promoters.  Although  in  refinement  ex¬ 
periments  theories  are  sometimes  corrupted  randomly 
(Pazzani  &  Brunk,  1993),  we  found  that  the  redun¬ 
dancy  in  this  theory  makes  it  very  robust  to  small 
corruptions.  Therefore,  we  generated  a  corrupt  the¬ 
ory  by  deleting  a  portion  of  the  theory  we  knew  to 
be  critical,  namely  the  intermediate  concept  minus  JS5 
(deleted  portion  shown  in  bold  in  Figure  6). 

Figure  7  shows  Banner-Pr  and  Banner’s  perfor¬ 
mance  with  this  damaged  theory  compared  to  Ban- 


Figure  6:  DNA  Promoter  Recognition  -  Initial 
Bayesian  Network 
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Figure  7:  Effect  of  structure  revision  on  corrupted  pro¬ 
moter  theory 


NER’s  performance  with  the  original  theory.  The  graph 
shows  that  removing  minus -35  degrades  the  theory  to 
the  extent  that,  for  most  points  in  the  learning  curve, 
parameter  revision  alone  cannot  recover  the  accuracy 
attained  with  the  original  theory.  The  results  shows 
that,  for  larger  training  sets,  structure  revision  is  ef¬ 
fective  at  recovering  a  fair  bit  of  the  accuracy  lost 
due  to  the  corruption,  although  the  difference  between 
Banner-Pr  auid  Banner  is  only  significant  (at  the 
0.05  level)  at  400  examples. 

The  fact  that  Banner  jmd  Banner-Pr  result  in  com¬ 
parable  accuracies  for  smaller  training  sets  can  be  ex¬ 
plained  by  the  fact  that  none  of  the  trials  with  10  and 
20  training  examples,  and  less  than  half  the  trieds  with 
50  examples  required  structure  revision.  Notice  that 
the  corrupted  theory  results  in  better  networks  than 
the  original  when  trained  on  10  examples.  With  20 
and  50  examples,  the  corrupted  theory  is  still  usually 
able  to  fit  the  training  excimples  without  structure  re¬ 
vision,  but  results  in  poorer  generalization.  This  leads 
to  the  hypothesis  that,  for  smaller  training  sets,  there 
are  several  theories  that  are  as  good  as  the  original  the¬ 
ory  in  fitting  the  training  set,  but  are  worse  in  terms 
of  generalization,  which  would  partially  explain  the 
observation  that  structure  revision  leads  to  improved 
training  accuracies  without  einy  improvements  in  gen¬ 
eralization,  when  trained  on  50  and  100  examples. 


Figure  8  illustrates  a  revised  network.  The  nodes  and 
links  added  by  Banner  are  indicated  by  shaded  el¬ 
lipses  aind  bolder  arrows  and  the  numbers  beside  the 
links  represent  parameter  values.  Note  that  some 
nodes  have  been  replicated  in  the  figure  for  clarity 
only.  Banner  added  several  features  to  the  network: 
P-35=T,  P-36=T,  P-34=G,  P-33=A  and  P-3^A  and 
added  new  links  from  features  already  present  in  the 
network:  P-11=A,  and  P-10=A.  In  addition,  it  has 
added  three  hidden  variables,  I-l  through  1-3.  A  com¬ 
parison  with  the  original  theory  indicates  that  the 
added  unit  I-l  roughly  corresponds  to  the  deleted 
minus-35  concept.  However,  in  the  original  the¬ 
ory,  minus J35  combines  conjunctively  with  minus.lO, 
whereas,  here  it  combines  disjunctively.  That  could 
explain  why  Banner  also  added  some  of  these  fea¬ 
tures  to  the  sub-network  above  minus  AO  A.  However, 
realize  that  the  initial  theory  is  not  known  to  have 
the  correct  structure,  it  is  simply  one  proposed  in  the 
biological  literature  that  is  also  consistent  with  the 
available  data.  Also,  note  that  the  modifications  to 
the  network  me  not  confined  to  2iny  particular  level 
(as  they  me  in  MEihoney  and  Mooney  (1994)). 

In  summmy,  our  experiments  demonstrate  that  Ban¬ 
ner  is  effective  in  revising  an  Bayesicin  networks  with 
hidden  variables  to  significantly  improve  their  accu¬ 
racy.  They  adso  demonstrate  that  the  structure  revi¬ 
sion  algorithm  contributes  significantly  to  the  overall 
algorithm  and  makes  semantically  interpretable  revi¬ 
sions.  The  effectiveness  of  the  structure  revision  al¬ 
gorithm  is  also  illustrated  by  the  fact  that  Banner- 
Ind  learns  highly  accurate  classifiers.  Experiments 
have  also  been  performed  that  show  that  Banner- 
Ind  lemns  more  accurate  classifiers  that  Naive  Bayes 
on  the  problem  of  classifying  chess  end-games  (Quin¬ 
lan,  1983).  Ramachandran  (1998)  provides  details  on 
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these  results. 

4  Related  Work 

While  recent  techniques  have  begun  to  address  the 
problem  of  learning  the  structure  of  a  Bayesian  net¬ 
work  from  incomplete  data  (Ramoni  &  Sebastian!, 
1997;  Friedman,  1997),  only  a  few  address  the  prob¬ 
lem  of  learning  or  revising  networks  with  hidden  vari¬ 
ables.  MS-EM  (Friedman,  1997)  extends  EM  to  learn 
the  structure  as  well  as  the  parameters  of  a  network 
from  incomplete  data.  While  it  works  when  the  ini- 
tieil  theory  contains  hidden  variables,  it  cannot  con¬ 
struct  new  hidden  variables.  Kwoh  and  Gillies  (1996) 
present  a  procedure  for  adding  hidden  variables  by 
first  learning  a  Bayesian  network  from  data  without 
hidden  variables,  and  then  using  statistical  analysis  to 
find  correlations  between  variables  with  the  same  cause 
and  clustering  such  variables  with  a  new  hidden  node. 
These  techniques  have  been  demonstrated  on  learning 
small  networks,  but  have  not  been  evaluated  on  larger, 
real-world  problems.  Moreover,  it  has  no  mechanism 
for  selecting  a  candidate  set  of  nodes  that  need  to  be 
revised,  instead  relying  on  blind  search  through  the 
space  of  all  possible  revisions. 

5  Future  Research 

Experiments  on  other  realistic  problems,  particularly 
ones  in  which  the  initial  theory  is  specified  as  a 
Bayesian  network  (rather  than  translated  from  rules), 
is  one  area  for  future  research.  The  current  results  for 
Banner  involve  problems  of  causal  inference,  tests  on 
tasks  involving  abductive  inference  are  also  needed. 
More  detjuled  comparisons  of  different  Bayes-net  in¬ 
duction  and  revision  algorithms  and  competing  meth¬ 
ods  on  realistic  problems  measuring  both  training  time 
and  predictive  accuracy  are  clearly  needed.  The  cur¬ 
rent  literature  on  Bayes-net  learning  is  particularly 
lacking  in  this  regard  relative  to  other  areas  of  ma¬ 
chine  learning  (Rriedman,  Goldszmidt,  Heckermain,  & 
Russell,  1997). 

Extending  Banner’s  general  approach  to  handle 
nodes  other  than  noisy-or/and  ones  is  an  important 
area  for  future  study.  Another  is  theory  refinement 
for  unsupervised  learning  where  there  is  not  a  specific 
targeted  inference  task.  The  algorithm  can  also  be  ex¬ 
tended  to  use  Bayesian  metrics  to  select  new  nodes  to 
be  added  to  the  parent  set  of  a  node.  A  number  of 
interesting  ideas  for  learning  and  revising  Bayes  nets 
have  been  proposed,  but  integrating  them  into  an  ef¬ 


ficient  and  effective  system  with  clearly  demonstrated 
advantages  over  other  machine-learning  methods  on 
realistic  problems  is  still  a  challenge. 

6  Conclusion 

We  have  introduced  a  novel  technique  for  revising 
Bayesian  networks  that  can  handle  existing  hidden 
variables  as  well  as  create  new  ones.  We  have  demon¬ 
strated,  through  experiments  on  realistic  problems, 
that  this  approach  can  efficiently  revise  large  networks 
and  produce  highly  accurate  classifiers.  The  results 
are  also  competitive  with  those  of  the  best  theory  re¬ 
finement  systems  while  maintaining  the  precise  proba¬ 
bilistic  semantics  of  Bayesian  networks  that  we  believe 
make  the  resulting  theories  significantly  more  com¬ 
prehensible.  Whereas  existing  techniques  for  revising 
BayesicUi  networks  must  search  through  the  space  of 
all  possible  revisions,  we  have  presented  novel  mech¬ 
anisms  for  using  the  information  in  the  data  to  gmde 
the  search  for  useful  revisions,  thus  focusing  the  search 
and  making  it  tractable  for  larger,  more  realistic  prob¬ 
lems. 
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Abstract 

We  present  and  solve  a  real-world  problem  of 
learning  to  drive  a  bicycle.  We  solve  the  prob¬ 
lem  by  online  reinforcement  learning  using  the 
Sarsa(A)-algorithm.  Then  we  solve  the  compos¬ 
ite  problem  of  learning  to  balance  a  bicycle  and 
then  drive  to  a  goal.  In  our  approach  the  rein¬ 
forcement  function  is  independent  of  the  task  the 
agent  tries  to  learn  to  solve. 

1  Introduction 

Here  we  consider  the  problem  of  learning  to  balance  on  a 
bicycle.  Having  done  this  we  want  to  drive  the  bicycle  to 
a  goal.  The  second  problem  is  not  as  straightforward  as  it 
may  seem.  The  learning  agent  has  to  solve  two  problems 
at  the  same  time:  Balancing  on  the  bicycle  and  driving  to 
a  specific  place.  Recently,  ideas  from  behavioural  psychol¬ 
ogy  have  been  adapted  by  reinforcement  learning  to  solve 
this  type  of  problem.  We  will  return  to  this  in  section  3. 

In  reinforcement  learning  an  agent  interacts  with  an  envi¬ 
ronment  or  a  system.  At  each  time  step  the  agent  receives 
information  on  the  state  of  the  system  and  chooses  an  ac¬ 
tion  to  perform.  Once  in  a  while,  the  agent  receives  a  re¬ 
inforcement  signal  r.  Receiving  a  signal  could  be  a  rare 
event  or  it  could  happen  at  every  time  step.  No  evalua¬ 
tive  feedback  from  the  system  other  than  the  failure  sig¬ 
nal  is  available.  The  goal  of  the  agent  is  to  learn  a  map¬ 
ping  from  states  to  actions  that  maximizes  the  agent’s  dis¬ 
counted  reward  over  time  [Bertsekas  and  Tsitsiklis,  1996, 
Sutton  and  Barto,  1998].  The  discounted  reward  is  the  sum 
SSo  where  7' is  the  discount  parameter. 

A  lot  of  techniques  have  been  developed  to  find  near  opti¬ 
mal  mappings  on  a  trial-and-error  basis.  In  this  paper  we 
use  the  Sarsa(A)-algorithm,  developed  by  Rummery  and 


1.  Initialize  all  eligibility  traces  eo  =  0. 

2.  Set  t  -  0. 

3.  Choose  action  at. 

4.  If  t  >  0  then  learn 

wt  =  wt-i  +  a  [rt_i  -I-  'yQt  -  Qt-i]  et_i. 

5.  Calculate  WwQt  with  respect  to  the  chosen  action. 

6.  Update  accumulating  traces  as 
et  =  yXet-i  +  V^Qt- 
Update  replacing  traces  as 

VtoQt  ifV^Qt#0, 

7Aet_i(s)  otherwise. 

7.  Perform  action,  receive  reinforcement-signal. 

8.  If  the  system  has  entered  a  terminal  state,  then 
1 f  +  1  and  jump  to  point  3. 

9.  Otherwise  perform  the  learning  (point  4)  with 
Qt=0. 


Figure  1 :  The  Sarsa(A)-algorithm. 

Niranjan  [Rummery  and  Niranjan,  1994,  Rummery,  1995, 
Singh  and  Sutton,  1996,  Sutton  and  Barto,  1998],  because 
empirical  studies  seem  to  suggest  that  this  algo¬ 
rithm  is  the  best  so  far  [Rummery  and  Niranjan,  1994, 
Rummery,  1995,  Sutton  and  Barto,  1998].  Figure  1  shows 
the  Sarsa(A)-algorithm.  We  have  modified  the  algorithm 
slightly  by  cutting  of  eligibility  traces  that  fall  below  10“^ 
in  order  to  save  calculation  time.  For  replacing  traces  we 
allowed  the  trace  for  each  state-action  pair  to  continue  un¬ 
til  that  pair  occurred  again,  contrary  to  Singh  and  Sutton 
[Singh  and  Sutton,  1996]. 

2  Learning  to  balance  on  a  bicycle 

Our  first  task  is  to  learn  to  balance.  At  each  time  step  the 
agent  receives  information  about  the  state  of  the  bicycle. 
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the  angle  and  angular  velocity  of  the  handle  bars,  the  an¬ 
gle,  angular  velocity  and  acceleration  of  the  angle  from  the 
bicycle  to  vertical.  For  details  of  the  bicycle  system  we 
refer  to  appendix  A. 

The  agent  chooses  two  basic  actions.  What  torque  should 
be  applied  to  the  handle  bars,  T  e  {-2N,0N, +2N}, 
and  how  much  the  centre  of  mass  should  be  displaced 
from  the  bicycle’s  plan,  d  G  {-2cm,0cm,-t-2cm}  • — a 
total  of  9  possible  actions.  Noise  is  laid  on  the  choice 
of  displacement,  to  simulate  an  imperfect  balance,  d  — 
^agents  choice +sp,  where  p  is  a  random  number  within  [-1;  1] 
and  s  is  the  noise  level  measured  in  centimeters.  We  use 
s  =  2  cm. 

Our  agent  consists  of  3456  input  neurons  and  9  output  neu¬ 
rons,  with  full  connectivity  and  no  hidden  layers.  The 
learning  rate  is  a  =  0.5.  The  continuous  state  data  is 
discretised  by  non-overlapping  intervals  in  the  state-space, 
such  that  there  is  exactly  one  active  neuron  in  the  input 
layer.  This  neuron  represent  state  information  for  all  the 
different  state  variables.  The  discrete  intervals  (boxes)  are 
based  on  the  following  quantization  thresholds: 

The  angle  the  handle  bars  are  displaced  from  normal,  6 :  0, 
±0.2,  ±1,  ±1  radians. 

The  angular  velocity  of  the  angle,  0:  0,  ±2,  ±oo  radi¬ 
ans/second. 

The  angle  from  vertical  to  bicycle,  cu:  0,  ±  0.06,  ±0.15, 
±^7r  radians. 

The  angular  velocity,  cu:  0,  ±  0.25,  ±  0.5,  ±oo  radi¬ 
ans/second. 

The  angular  acceleration,  w:  0,  ±2,  ±oo  radians/second^. 


Trial 

Figure  2:  Number  of  seconds  the  agent  can  balance  on  the 
bicycle,  as  a  function  of  the  number  of  trials.  Average  of  40 
agents.  (After  the  agent  has  learned  the  task,  1000  seconds 
are  used  in  calculation  of  the  average.) 


Figure  2  shows  the  number  of  seconds  the  agent  can  bal¬ 
ance  on  the  bicycle  as  a  function  of  the  number  of  trials. 
When  the  agent  can  balance  for  1000  seconds,  the  task  is 
considered  learned.  Here  A  =  0.95  and  7  =  0.99.  Sev¬ 
eral  CMAC-systems  (also  know  as  generalized  grid  cod¬ 
ing)  [Watkin.s,  1989,  Santamarfa  et  al.,  1996,  Sutton,  1996, 
Sutton  and  Barto,  1998],  were  also  tried,  but  none  of  them 
gave  the  agent  a  learning  time  below  5000  trials. 

Figures  3  and  4  show  the  move¬ 
ments  of  the  bicycle  at  the  be¬ 
ginning  of  a  learning  process 
seen  from  above.  Each  time  the 
bicycle  falls  over  it  is  restarted 
at  the  starting  point.  At  each 
time  step  a  line  is  drawn  be¬ 
tween  the  points  where  the 
tyres  touch  the  ground. 

Both  accumulating  and  replac¬ 
ing  eligibility  traces  were  tried. 

The  results  are  shown  in  fig¬ 
ure  5.  The  results  found  sup¬ 
port  the  general  conclusions 
drawn  by  Singh  and  Sutton 
[Singh  and  Sutton,  1996]:  Re¬ 
placing  traces  make  the  agent 
perform  much  better  than  con¬ 
ventional,  accumulating  traces. 

Long  traces  help  the  agent  best. 

3  Shaping 

The  idea  of  shaping,  which  is  borrowed  from  behavioural 
psychology,  is  to  give  the  learning  agent  a  scries  of  rela¬ 
tively  easy  problems  building  up  to  the  harder  problem  of 
ultimate  interest  [Sutton  and  Barto,  1998].  The  term  origi¬ 
nates  from  the  psychologist  Skinner  [Skinner,  1938],  who 
studied  the  effect  on  animals,  especially  pigeons  and  rats. 

To  train  an  animal  to  produce  a  certain  behavior,  the 
trainer  must  find  out  what  subtasks  constitute  an  approx¬ 
imation  of  tbe  desired  behavior,  and  how  these  should 
be  reinforced  [Staddon,  1983].  By  rewarding  successive 
approximations  to  tbe  desired  behavior,  pigeons  can  be 
brought  to  pecking  a  selected  spot  [Skinner,  1953,  p.  93], 
horses  to  do  clever  tricks  in  a  circus  like  seemingly  recog¬ 
nize  flags  of  nations  or  numbers  and  to  do  calculation 
[J0rgensen,  1962,  pp.  137-139],  and  pigs  to  perform  com¬ 
plex  acts  as  eating  breakfast  at  a  table  and  vacuuming 
the  floor  [Atkinson  et  al.,  1996,  p.  242].  Staddon  notes 
that  human  education  as  well  is  built  up  as  a  process  of 
shaping  if  behavior  is  taken  to  include  “understanding” 
[Staddon,  1983,  p.  458]. 


Figure  3:  The  first  151 
trials  seen  from  above. 
The  longest  path  is  7 
meters. 
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Figure  4:  The  same  route  as  figure  3  a  little  later.  Now  the 
agent  can  balance  the  bicycle  for  30-40  meters.  The  agent 
starts  each  trial  in  a  equilibrium  position  {6, 9,  w,  Co,  Co)  = 
(0, 0, 0, 0, 0).  During  the  first  trials  it  learns  to  avoid  dis¬ 
turbing  this  unnecessarily,  i.e.  it  learns  to  keep  driving 
straight  forward.  Now  the  most  difficult  part  of  the  learning 
remains:  To  learn  to  come  safe  though  a  dangerous  situa¬ 
tion.  A  weak  (random)  preference  for  turning  right  (instead 
of  left)  is  strengthened  during  the  learning  as  the  agent  gets 
better  at  handling  problematic  situations  and  therefor  re¬ 
ceives  less  discounted  punishment  than  expected. 


Shaping  can  be  used  to  speed  up  the  learning  process  for 
a  problem  or  in  general  to  help  the  reinforcement  learning 
technique  scale  to  large  and  more  complex  problems.  But 
there  is  a  price  to  be  paid  for  faster  learning:  We  must  give 
up  the  tabula  rasa  attitude  that  is  one  of  the  attractive  as¬ 
pects  of  basic  reinforcement  learning.  To  use  shaping  in 
practice  one  must  know  more  about  the  problem  than  just 
under  which  conditions  an  absolute  good  or  bad  state  has 
been  reached.  This  introduces  the  risk  that  the  agent  learns 
a  solution  to  a  problem  that  is  only  locally  optimal. 

There  are  at  least  three  ways  to  implement  shaping  in  rein¬ 
forcement  learning:  By  lumping  basic  actions  together  as 
macro-actions,  by  designing  a  reinforcement  function  that 
rewards  the  agent  for  making  approximations  to  the  desired 
behavior,  and  by  structurally  developing  a  multi-level  ar¬ 
chitecture  that  is  trained  part  by  part. 

Selfridge,  Sutton  and  Barto  showed  that  transferring 
knowledge  from  solving  an  easy  version  of  a  problem  such 
as  the  classical  pole  mounted  on  a  cart  can  ease  learning  a 
more  difficult  version  [Selfridge  et  al.,  1985]. 

McGovern,  Sutton  and  Fagg  have  tested  macro-actions  in  a 
gridworld  and  found  that  in  some  cases  they  accelerate  the 
learning  process  [McGovern  et  al.,  1997]. 

Dorigo,  Colombetti  and  Borghi  have  worked  with 
shaping  for  real  robots  [Dorigo  and  Colombetti,  1993, 
Colombetti  et  al.,  1996,  Dorigo  and  Colombetti,  1997]. 
They  use  reinforcement  learning  as  a  mean  to  translate 
suggestions  from  an  external  trainer.  The  trainer  is  a 
programme  in  itself  with  a  high-level  representation  of 
the  desired  behavior  that  provided  immediate  reinforce¬ 
ment.  For  instance  in  the  “The  Hamster  Experiment” 
[Colombetti  et  al.,  1996]  the  robot’s  task  is  to  collect 
pieces  of  food  (colored  cans)  and  bring  them  to  its  nest. 
The  trainer  provides  the  agent  with  a  reinforcement  signal 
for  approaching  the  food.  This  signal  is  proportional  to  the 
decrease  in  the  distance  between  the  robot  and  the  pieces 
of  food.  The  training  of  the  agent  boils  down  to  translating 
the  high-level  trainer  to  a  low-level  control  programme. 
This  method  of  shaping  by  a  trainer  has  a  number  of 
advantages  as  well  as  disadvantages.  The  agent  does  not 
have  to  solve  the  delayed  reinforcement  problem.  But  on 
the  other  hand,  the  programmer  of  the  trainer  must  know 
in  advance  what  high-level  behavior  is  desired,  and  to  such 
a  degree  that  the  trainer  can  judge  how  well  a  single  move 
fits  into  the  desired  behavior. 

Mataric  has  studied  the  possibility  of  putting  implicit  do¬ 
main  knowledge  into  the  agent  by  construction  a  more 
complex  reinforcement  function  than  commonly  used 
[Mataric,  1994].  Again  the  theory  was  tested  on  a  real  robot 
moving  cans  to  a  nest.  Here  the  constructed  function  did 
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Figure  5.  Learning  time  for  different  values  of  A  for  accumulating  eligibility  traces  (left)  and  replacing  traces  (right).  Each 
point  is  an  average  of  30  simulations. 


not  eliminate  the  need  for  solving  the  delayed  reinforce¬ 
ment  problem. 

Gullapalli  has  studied  two  implementations  of  shaping 
[Gullapalli,  1992].  In  the  first  the  complexity  of  the  con¬ 
trol  task  is  gradually  increased  during  learning,  and  the  re¬ 
inforcement  function  used  is  changed  accordingly.  In  this 
way  most  of  a  training  run  is  used  in  learning  the  approx¬ 
imation  to  the  current  target  behavior.  This  system  was 
used  to  make  a  simulated  robot  hand  perform  a  series  of 
key  strokes  on  a  calculator.  The  actual  task  consisted  of  six 
subtasks.  Secondly  Gullapalli  considered  structural  shap¬ 
ing;  An  incremental  development  of  the  learning  system 
where  a  multi-level  architecture  is  trained  in  parts. 

Gerald  Tesauro’s  Backgammon  playing  agent  achieved 
master  level  play  through  self-play  [Tesauro,  1992, 
Tesauro,  1994,  Tesauro,  1995].  This  can  be  considered  as 
a  very  succesfull  example  of  the  use  of  shaping.  Self-play 
is  a  sort  of  shaping,  since  at  first  the  agent  plays  against  a 
nearly  random  opponent  and  thereby  solves  an  easy  task. 
The  complexity  of  the  task  then  grows  as  the  agent  gets 
better  at  playing. 

In  Gullapalli’s  experiments  [Gullapalli,  1992]  and 
Selfridge,  Sutton  and  Barto’s  [Selfridge  et  ah,  1985], 
as  well  as  in  Dorigo,  Colombetti  and  Borghi’s 
[Colombetti  et  al.,  1996,  Dorigo  and  Colombetti,  1997], 
the  agent  received  a  different  reinforcement  signal  over 
time  for  the  same  behavior.  This  is  not  in  agreement  with 
the  original  inspiration  of  the  reinforcement  signal  as  being 
a  hardwired  signal  inside  the  brain  of  a  animal.  To  solve 
this  problem,  we  need  the  reinforcement  function  to  be 
independent  of  what  task  the  agent  tries  to  learn  to  solve. 
Our  approach  in  general  is  to  let  the  most  basic  tasks  result 
in  the  lowest  reinforcement  signals  and  more  advanced 
tasks  correspond  to  larger  signals. 


Say,  we  want  a  robot 
to  learn  to  move 
forward  like  a  child 
(see  figure  6).  As  a 
child  grows  stronger 
it  discovers  more 
complex  and  faster 
ways  of  moving. 
Performing  each 
way  of  moving  can 
be  seen  as  a  task 
that  is  more  difficult 
than  the  former. 
The  robot  starts 
by  learning  to  roll. 
Having  done  so,  it 


■■  Signal  for  running 

■■  Signal  for  walking 

■-  Signal  for  crawling 

••  Signal  for  rolling 

"•  Signal  for  not  moving 

Figure  6:  Reinforcement  signals 
for  the  movements  of  a  child- 
robot. 


might  di.scover  how  to  crawl.  The  reinforcement  signal  for 
crawling  is  greater  than  rolling,  and  greater  than  what  the 
agent  expects  to  receive,  and  therefore  it  acts  as  a  reward. 
Later  after  having  learned  to  walk,  failing  to  walk  and 
falling  back  on  crawling  makes  the  robot  receive  a  smaller 
reinforcement  signal  than  it  expected,  and  the  internal 
reinforcement  signal  becomes  negative — that  is  the  signal 
acts  as  a  punishment. 


Can  these  basic  ideas  of  shaping  be  applied  to  reinforce¬ 
ment  learning,  and  make  it  possible  to  solve  a  complex 
problem  with  more  than  one  goal?  We  will  now  turn  to 
a  practical  study  of  these  theoretical  issues. 


4  Learning  to  drive  to  a  goal  using  shaping 

We  want  to  study  shaping  on  the  composite  problem  of 
learning  to  balance  a  bicycle  and  then  drive  to  a  goal.  In 
contrast  to  other  experiments  with  shaping,  we  want  the 
agent  to  be  totally  in  charge  of  when  to  switch  task.  When 
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Figure  7:  The  weight,  from  the  cu-oriented  input  neurons  (left)  and  the  weights  from  the  angle  oriented  input  neurons 
(right).  Note  the  difference  of  the  scale. 


one  drives  a  bicycle  in  the  morning  to  the  institute  and  hits 
a  hole  in  the  road,  one  instantaneously  forgets  about  where 
to  go  and  focus  attention  on  the  balance.  We  want  the  agent 
to  be  able  to  switch  task  equally  swiftly  when  it  find  the  sit¬ 
uation  appropriate. 

The  bicycle  starts  out  at  the  origin  heading  west.  The  goal 
is  a  circular  spot  (10  meter  radius)  positioned  1000  meters 
to  the  north  of  the  starting  point. 

We  enlarge  our  basic  network  by  20  more  input  neurons, 
with  full  connectivity  to  the  9  output  neurons.  The  angle 
between  the  driving  direction  and  the  direction  to  goal  is 
discretised  by  18°  intervals,  one  for  each  neuron.  Now 
there  are  exactly  two  active  neurons  in  the  agents  input 
layer — one  for  the  state  of  the  bicycle  and  one  for  the  driv¬ 
ing  direction  relative  to  the  goal.  The  learning  rate  for 
the  weights  from  the  angle-input  neurons  is  chosen  to  be 
0.01 — much  smaller  than  the  rate  for  the  other  weights, 
in  order  to  reflect  the  different  time  scales  in  the  learning 
tasks:  We  do  not  want  the  weights  in  the  angle  oriented  part 
to  grow  large  while  the  agent  learns  to  balance  the  bicycle. 
The  odds  are  against  these  weights  ending  up  containing 
anything  useful. 

The  reinforcement  function  is  independent  of  the  task  the 
agent  tries  to  learn  to  solve.  If  the  bicycle  falls  over,  the 
agent  always  receives  -1,  if  the  agent  reaches  the  goal 
it  is  rewarded  by  r  =  0.01,  and  otherwise  the  agent  re¬ 
ceives  r  =  (4  -  'ipg)  •  0.00004,  where  tpg  is  the  angle  be¬ 
tween  the  driving  direction  and  the  direction  to  goal  mea¬ 
sured  in  radians.  The  agent  is  punished  when  driving  away 
from  the  goal  and  rewarded  when  driving  towards  it.  This 
reinforcement  function  is  inspired  by  the  signal  used  by 
Colombetti,  Dorigo  and  Borghi  [Colombetti  et  al.,  1996] 
mentioned  earlier.  Note  that  the  agent  still  have  to  solve 
the  delayed  reinforcement  problem.  As  one  can  see,  the 


numerical  value  of  this  signal  is  quite  small.  We  tried 
larger  values,  which  made  the  agent  learn  to  drive  in  the 
correct  orientation  without  being  able  to  balance.  After  a 
few  hundred  trials  the  agent  at  the  starting  point  immedi¬ 
ately  threw  the  bicycle  to  the  right.  The  positive  reinforce¬ 
ment  it  received  due  to  the  correct  orientation  in  several 
time  steps  was  large  enough  to  make  up  for  the  punishment 
from  falling. 


Trial 

Figure  8:  Number  of  times  an  agent  drives  the  bicycle  to 
the  goal  for  twelve  agents. 

Figure  8  shows  the  number  of  times  twelve  agents  reach 
the  goal.  In  a  typical  learning  process  it  takes  the  agent 
1700  trials  to  learn  to  balance  (i.e.  drive  more  than  1000  s 
without  falling),  and  after  about  4200  trials  it  gets  to  the 
goal  for  the  first  time.  After  a  total  of  approximately  5700 
trials  it  drives  to  the  goal  more  or  less  every  time. 

Figure  7  shows  the  values  of  some  of  the  important  weights 
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after  learning.  The  w-weights  shown  are  an  average  of 
weight  values  around  0  =  0,  0  =  0,  w  =  0  and  w  =  0. 
If  the  agent  drives  along  in  balance,  the  weights  with  val¬ 
ues  in  the  relatively  flat  upper  area  are  active  for  the  bal¬ 
ance  oriented  input  neurons,  and  the  values  of  the  an¬ 
gle  oriented  neurons  matter  for  the  choice  of  action.  The 
weights  belonging  to  the  balance  oriented  input  neurons 
makes  the  agent  prefer  action  3, 4  and  5  (which  corresponds 
to  T  =  0),  but  the  weights  belonging  to  the  angle  oriented 
neurons  decide  which  one.  But  if  the  state  of  the  bicycle 
enters  an  area  of  unbalance,  the  balance  oriented  input  neu¬ 
rons  have  far  greater  differences  in  values  of  the  weights, 
and  as  a  result  the  angle  oriented  input  neurons  do  not  make 
any  difference  for  the  choice  of  action.  In  other  words:  The 
agent  swiftly  shifts  attention  from  the  task  of  finding  the 
goal  to  the  task  of  balancing  the  bicycle  if  required. 


Figure  9:  A  typical  route  when  the  agent  reaches  the  goal 
for  the  first  time. 


Figure  9  and  10  shows  routes  from  the  starting  point  to  the 
goal  (the  grey  circle  on  the  y-axis).  The  first  drives  to  the 
goal  can  be  as  long  as  200  km,  but  the  agent  soon  learns  to 
drive  to  the  goal  driving  “only”  7  km.  A  driving  distance 
as  short  as  1680  m  has  been  observed. 


Figure  10:  Already  after  10  drives  to  the  goal  the  agent 
navigates  a  little  better. 


The  goal  is  not  reached  just  by  coincidence.  The  probabil¬ 
ity  for  hitting  the  goal  at  random  is  quite  small.  An  estimate 
for  the  time  required  to  reached  the  goal  by  doing  a  corre¬ 
lated  random  walk  is  10*°  time  steps.  (The  bee  line  from 
the  starting  point  to  the  goal  is  3.610'*  time  steps.)  In  other 
words:  If  the  agent  had  to  solve  the  problem  of  learning  to 
drive  to  the  goal  without  access  to  the  shaping  reinforce¬ 
ment  signal,  i.e.  the  tabular  rasa  approach,  it  would  take 
enormous  amounts  of  time  before  it  hits  the  goal  for  the 
first  time  and  experiences  the  reward  for  getting  there. 

We  agree  with  Mataric  [Mataric,  1994]  that  these  hetero¬ 
geneous  reinforcement  functions  have  to  be  designed  with 
great  care.  In  our  first  experiments  we  rewarded  the  agent 
for  driving  towards  the  goal  but  did  not  punish  it  for  driv¬ 
ing  away  from  it.  Consequently  the  agent  drove  in  circles 
with  a  radius  of  20-50  meters  around  the  starting  point. 
Such  behavior  was  actually  rewarded  by  the  reinforcement 
function,  furthermore  circles  with  a  certain  radius  arc  phys¬ 
ically  very  stable  when  driving  a  bicycle  because  of  the 
cross  terms  in  eqs.  (2)  and  (3)  in  the  appendix. 
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5  Conclusion 

Our  results  demonstrate  the  utility  of  reinforcement  learn¬ 
ing  on  a  difficult,  dynamical  real  world  problem.  It  is  pos¬ 
sible  to  learn  to  balance  a  bicycle  by  pure  reinforcement 
learning  with  only  one  (rare)  reinforcement  signal.  Further¬ 
more  it  is  possible  to  learn  a  solution  to  the  double  problem 
of  balancing  on  the  bicycle  and  driving  to  a  goal  by  com¬ 
bining  reinforcement  learning  with  shaping.  The  applica¬ 
tion  of  shaping  accelerated  the  learning  process  immensely. 
Without  shaping,  it  would  not  have  been  practical  to  wait 
for  the  agent  to  discover  the  goal  and  the  reward  for  getting 
there. 
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A  Details  of  the  Bicycle  Simulation 

The  bicycle  must  be  held  upright  within  ±12°  measured 

from  vertical  position.  If  the  angle  from  the  vertical  to  the 

bicycle  falls  outside  this  interval,  the  bicycle  has  fallen,  and 
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the  agent  receives  punishment  -1.  The  Bicycle  is  mod¬ 
eled  by  the  following  non-linear  differential  equations.  One 
simplification  was  made  to  ease  the  derivation  of  the  equa¬ 
tions:  The  front  fork  was  assumed  to  be  vertical,  which  is 
unusual  but  not  impossible.  This,  however,  made  the  task  a 
bit  more  difficult  for  the  agent. 

There  are  two  important  angles  in  this  problem:  The  angle 
6  of  the  direction  of  the  bicycle  from  straightforward,  and 
the  angle  w  the  bicycle  is  tiled  from  vertical.  The  conser¬ 
vations  of  angular  momentum  of  the  tyres  results  in  some 
important  cross  terms. 

The  equations  do  not  model  a  bicycle  exactly,  as  some  sec¬ 
ond  order  cross  effects  were  ignored  during  the  derivation. 
However  we  believe  that  the  largest  problem  of  transfer¬ 
ring  to  a  real  bicycle  would  be  to  build  hardware  that  could 
withstand  falling  over  a  thousand  times — not  just  without 
crashing  but  also  without  changing  and  thereby  make  the 
system  unstationary. 


The  angular  acceleration  w  can  be  calculated  as: 

^  =  - -  j  Mhgsinip 

■'bicycle  and  cyclist  \ 

-  cos  -f  sign  (0)  (2) 

.  (  Mil  +  Mil  + 

V  ri,  rewi))) 

This  equation  is  the  mechanical  equation  for  angular  mo¬ 
mentum.  The  physical  contents  of  the  right  hand  side  are 
terms  for  the  gravitation,  effects  of  the  the  conservation  of 
angular  momentum  of  the  tyres  and  the  fictional  centrifugal 
force.  The  term  I^c  dO  is  important  for  understanding  why 
it  is  relative  easier  to  ride  a  bicycle  than  to  keep  the  balance 
on  a  bicycle  standing  still.  The  cross  effects  that  originate 
from  the  conservation  of  angular  momentum  of  the  tyres 
stabilize  the  bicycle,  and  this  effect  is  proportional  to  the 
angular  velocity  of  the  tyres  a  and  thereby  to  the  veloeity 
of  the  bicycle. 

The  angular  acceleration  6  of  the  front  tyre  and  the  handle 
bars  is: 

e  =  (3) 

idl 

These  equations  are  not  an  exact  analytical  description,  as 
some  second  (and  higher)  order  terms  have  been  ignored. 
The  values  of  w,  w,  w,  6,  9  are  send  to  the  agent  at  each 
time  step.  The  agent  returns  the  value  of  d  and  the  torque 
T. 


Figure  11:  The  bicycle  as  seen  from  behind.  The  thick 
line  represents  the  bicycle.  CM  is  the  centre  of  mass  of  the 
bieycle  and  cyclist. 

The  following  equations  describe  the  mechanics  of  the  sys¬ 
tem.  (See  figure  1 1.)  The  angle  y)  is  the  total  angle  of  tilt  Figure  12:  Seen  from  above.  The  thick  line  represents  the 
of  the  centre  of  mass,  and  is  defined  as:  front  tyre. 

def 

g)  =  w  -\-  arctan 


The  front  and  back  tyres  follow  different  paths  in  a  curve 
with  different  radii  (see  figure  12).  The  front  tyre  follows 
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the  longest  path.  The  radius  for  the  front  tyre  is: 


Vf  = 


I 


I 


|cos(|-0)|  [sin^l 
And  for  the  back  tyre: 

I 


Tb  =  I 


tan 


(1-^)1  = 


I  tan0| 

For  the  CM  the  radius  can  be  calculated  as: 


rcM 


— V 

(tan^)^  / 


(4) 


(5) 


(6) 


The  equations  of  the  position  of  the  tyres  for  the  front  tyre: 


/ -  sin(V)  +  0  +  sign(T/i  +  6)  arcsin(|^))  \ 
y  cos(V>  +  0  +  sign(V'  +  0)  arcsin{f^))  j 


And  for  the  back  tyre: 


f- sm{i)  +  sign(V')  arcsin(|^))  \ 

^ V  cos(V'  +  sign(i/>)  arcsin(^))  J 

We  estimated  the  values  of  the  moments  of  inertia  to: 

13 

fbicycle  and  cyclist  —  ~^^ch  +  Mp  (h  +  ^cm)  (7) 

The  various  moments  of  inertia  for  a  tyre  was  estimated  to 
(see  figure  13): 


Idc 

=  Mdr^ 

(8) 

Idv 

= 

(9) 

idl 

=  ^Mdr^ 

(10) 

Table  1  shows  the  values  of  the  parameters  used  for  the 
bicycle  system. 


Figure  13:  Axis  for  moments  of  inertia  for  a  tyre. 


Notation 

Value 

c 

Horizontal  distance  between  the 
point,  where  the  front  wheel 
touches  the  ground  and  the  CM. 

66  cm 

CM 

The  Centre  of  Mass  of  the 
bicycle  and  cyclist  as  a  total 

d 

The  agent’s  choice  of  the 
displacement  of  the  CM 
perpendicular  to  the  plan  of  the 
bicycle 

dcM 

The  vertical  distance  between 
the  CM  for  the  bicycle  and  for 
the  cyclist. 

30  cm 

h 

Height  of  the  CM  over  the 
ground 

94  cm 

1 

Distance  between  the  front  tyre 
and  the  back  tyre  at  the  point 
where  they  touch  the  ground 

111  cm 

Me 

Mass  of  the  bicycle 

15  kg 

Md 

Mass  of  a  tyre 

1.7  kg 

Mp 

Mass  of  the  cyclist 

60  kg 

r 

Radius  of  a  tyre 

34  cm 

a 

The  angular  velocity  of  a  tyre 

T 

The  torque  the  agent  applies  on 
the  handlebars 

V 

The  velocity  of  the  bicycle 

10  km/h 

Table  1:  Notation  and  values  for  the  bicycle  system. 
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Abstract 

In  this  paper,  we  consider  learning  first-order 
Horn  programs  from  entailment.  In  particu¬ 
lar,  we  show  that  any  subclass  of  first-order 
acyclic  Horn  programs  with  constant  arity  is 
exactly  learnable  from  equivalence  and  en¬ 
tailment  membership  queries  provided  it  al¬ 
lows  a  polynomial-time  subsumption  proce¬ 
dure  and  satisfies  some  closure  conditions. 

One  consequence  of  this  is  that  first-order 
acyclic  determinate  Horn  programs  with  con¬ 
stant  arity  are  exactly  learnable  from  equiv¬ 
alence  and  entailment  membership  queries. 

1  Introduction 

Learning  first-order  Horn  programs — sets  of  first-order 
Horn  clauses — is  an  important  problem  in  inductive 
logic  programming  with  applications  ranging  from 
speedup  learning  to  grammatical  inference. 

We  are  interested  in  speedup  learning,  which  concerns 
learning  domain-specific  control  knowledge  to  allevi¬ 
ate  the  computational  hardness  of  planning.  One  kind 
of  control  knowledge,  which  is  particularly  useful  in 
many  domains,  is  represented  as  goal-decomposition 
rules.  Each  decomposition  rule  specifies  how  a  goal 
can  be  decomposed  into  a  sequence  of  subgoals,  given 
that  a  set  of  conditions  is  true  in  the  initial  problem 
state.  Each  of  the  subgoals  might  in  turn  have  a  set 
of  decomposition  rules,  unless  it  is  a  primitive  action, 
in  which  case  it  can  be  directly  executed. 

Unlike  in  logical  inference,  for  which  Horn  clauses  are 
ideally  suited,  in  planning,  one  needs  to  keep  track  of 

*This  paper  also  appears  in  the  proceedings  of  8th  In¬ 
ternational  Conference  on  Inductive  Logic  Programming, 
1998  (ILP-98). 


time.  In  spite  of  this  difference,  goal-decomposition 
rules  can  be  represented  as  first-order  Horn  clauses  by 
adding  two  situation  variables  to  each  literal  to  in¬ 
dicate  the  time  interval  in  which  the  literal  is  true. 
Hence,  the  problem  of  learning  goal-decomposition 
rules  for  a  single  goal  can  be  mapped  to  learning  first- 
order  Horn  definitions — a  set  of  Horn  clauses,  all  hav¬ 
ing  the  same  head  or  consequent  literal.  Learning  goal- 
decomposition  rules  for  multiple  goals  corresponds  to 
learning  first-order  Horn  programs.  Henceforth,  we 
omit  the  prefix  “first-order”,  except  when  there  is  a 
possibility  of  ambiguity. 

In  learning  from  entailment,  a  positive  (negative)  ex¬ 
ample  is  a  Horn  clause  that  is  implied  (not  implied) 
by  the  target.  Results  by  Cohen  (1995a,  1995b),  Dze- 
roski  et  al.  (1992)  and  others  indicate  that  classes  of 
Horn  programs  having  a  single  or  a  constant  number  of 
clauses  are  learnable  from  examples.  Khardon  shows 
that  “actions  strategies”  consisting  of  a  variable  num¬ 
ber  of  constant-size  first-order  production  rules  can  be 
learned  from  examples  (Khardon,  1996).  However,  Co¬ 
hen  (1995a)  proves  that  even  predicting  very  restricted 
classes  of  Horn  programs  (viz.  function-free  0-depth 
determinate  constant  arity)  with  variable  number  of 
clauses  of  variable  size  from  examples  alone  is  crypto¬ 
graphically  hard. 

FVazier  and  Pitt  (1993)  first  used  the  entailment  set¬ 
ting  for  learning  arbitrary  propositional  Horn  pro¬ 
grams.  In  addition  to  examples,  they  also  used  entail¬ 
ment  membership  queries  (“entailment  queries”  from 
now  on)  which  ask  if  a  Horn  clause  is  entailed  by  the 
target.  Moving  to  first  order  representations,  Frazier 
and  Pitt  (1993)  showed  that  Classic  sentences  are 
exactly  learnable  in  polynomial  time  from  examples 
and  entailment  queries.  A  Horn  clause  is  simple  if 
the  terms  and  the  variables  in  the  body  of  the  clause 
are  restricted  to  the  terms  that  appear  in  the  head. 
Page  (1993)  considered  non-recursive  Horn  programs 
restricted  to  simple  clauses  and  predicates  of  constant 
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arity,  and  showed  that  they  are  learnable  from  exam¬ 
ples  and  entailment  queries.  Arimura  (1997)  general¬ 
ized  Page’s  result  to  acyclic  (possibly,  recursive)  simple 
Horn  programs  with  constant-arity  predicates.  Reddy 
and  Tadepalli  (1997b)  showed  that  function-free  non¬ 
recursive  Horn  definitions  are  learnable  from  examples 
and  entailment  queries.  The  result  we  present  here  ap¬ 
plies  to  non-generative  Horn  programs,  where  the  vari¬ 
ables  and  the  terms  in  the  head  are  restricted  to  those 
in  the  body.  We  show  that  acyclic  non-generative  Horn 
programs  with  constant  arity  that  have  polynomial¬ 
time  subsumption  procedure  are  learnable  from  exam¬ 
ples  and  entailment  queries  when  certain  closure  con¬ 
ditions  are  satisfied.  In  particular,  the  result  applies 
to  acyclic  Horn  programs  with  constant  arity  determi¬ 
nate  clauses. 

Goal-decomposition  rules  are  hierarchical  in  nature, 
as  are  Horn  programs.  One  aspect  of  learning  in  hi¬ 
erarchical  domains  is  the  hierarchical  order  of  literals 
(goals  or  concepts).  In  many  systems,  learning  hierar¬ 
chically  organized  knowledge  assumes  that  the  struc¬ 
ture  of  hierarchy  or  the  order  of  the  literals  is  known 
to  the  learner.  Examples  of  such  work  include  Mar¬ 
vin  (Sammut  &  Banerji,  1986)  and  XLearn  (Reddy 
&  Tadepalli,  1997a),  on  the  experimental  side;  learn¬ 
ing  from  exercises  by  Natarajan  (1989)  and  learning 
acyclic  Horn  sentences  by  Arimura  (1997),  on  the  theo¬ 
retical  side.  In  fact,  Khardon  shows  that  learning  hier¬ 
archical  strategies  can  be  computationally  hard  when 
the  structure  of  the  hierarchy  is  not  known  (Khardon, 
1996).  Our  algorithm  also  assumes  that  the  hierarchi¬ 
cal  order  of  the  literals  is  known. 

The  rest  of  the  paper  is  organized  as  follows.  Section 
2  provides  definitions  for  some  of  the  terminology  we 
use.  Section  3  describes  the  learning  model  and  the 
learning  algorithm,  and  proves  the  leaxnability  result. 
Section  4  concludes  the  paper  with  some  discussion  on 
implications  and  limitations  of  the  work. 

2  Preliminaries 

In  this  section,  we  define  and  describe  some  of  the  ter¬ 
minology  we  use  in  the  rest  of  the  paper.  For  brevity, 
we  omit  some  of  the  standard  terminology  (as  given 
in  books  such  as  (Lloyd,  1987)).  In  the  following,  we 
use  p  and  its  variants,  and  o  and  its  variants  each  to 
stand  for  a  conjunction  of  literals;  and  b,q,l  and  their 
variants  each  to  stand  for  a  single  literal. 

Definition  1  A  definite  Horn  clause  (Horn  clause 
or  clause,  for  short)  is  a  finite  set  of  literals  that  con¬ 
tains  exactly  one  positive  literal — {I,  -'ll, ->12, 

It  is  treated  as  a  disjunction  of  the  literals  in  the  set 
with  universal  quantification  over  all  the  variables.  Al¬ 
ternately,  it  is  represented  as  li,l2,---,ln  I,  where 


1  is  called  the  head  or  consequent,  and  h,l2,---,ln 
is  called  the  body  or  antecedent  and  is  interpreted  as 
Zi  AZ2  A . . .  AZ„.  a  unit  Horn  clause  is  a  Horn  clause 
with  no  negative  literals  and  hence  no  body.  A  Horn 
program  or  Horn  sentence  is  a  set  of  definite  Horn 
clauses  interpreted  conjunctively. 

Definition  2  Let  Ci  and  C2  be  sets  of  literals.  We 
say  that  Ci  subsumes  C2  (denoted  Ci  >.  C2)  iff  there 
exists  a  substitution  0  such  that  Ci9  C  €2.  We  also 
say  Cl  is  a  generalization  of  C2  ■ 

Definition  3  (Plotkin,  1970)  LetC,  C,  Ci  andC2 
be  sets  of  literals.  We  say  that  C  is  the  least  general 
generalization  (Igg)  of  Ci  and  C2  iff  C  y  Ci  and 
C  y  C2,  and  C  y  C,  for  any  C  such  that  C  y  Ci 
and  C  yC^. 

Definition  4  (Plotkin,  1970)  A  selection  of  clau¬ 
ses  Cl  and  C2  is  a  pair  of  literals  {hjh)  such  that 
h  6  Cl  and  I2  G  <^2,  and  h  and  I2  have  the  same 
predicate  symbol,  arity,  and  sign. 

If  Cl  and  C2  are  sets  of  literals,  then  lgg(Ci,C2)  is 
{IggOuh)  ■  {h,h)  is  a  selection  of  Ci  and  62}-  If  I 
is  a  predicate,  lgg{lisi,S2, .  •  ■  ,Sn),liti,t2,  ■  ■  ■  ,tn))  is 
lilgglsi, h),..., lgg{sn, t„)).  The  Igg  of  two  terms 
/(si,...,s„)  and  giti,...,tm),  if  /  =  and  n  =  m, 
is  f{lgg{si,ti),...,lgg{sn,tn));  else,  it  is  a  variable 
X,  where  x  stands  for  the  Igg  of  that  pair  of  terms 
throughout  the  computation  of  the  Igg  of  the  set  of 
literals. 

As  an  example,  let  Ci  be  l{a,b),l{b,c),Tn{b)  -> 
Z(o,c),  and  C2  be  Z(1,2),Z(2,3),  m(2)  -)■  Z(l,3). 
(Z(a,c),/(1,3))  and  {-<m{b),^m{2))  are  two  of 
the  selections  of  Ci  and  C2.  lgg{Ci,C2)  is 
lix,y),l{y,z),l{t,u),l{v,w),m{y)  l{x,z),  where 

X,  y,  z,  t,  u,  v  and  w  are  variables  standing  for  the  pairs 
(a,  1),  {b,  2),  (c,  3),  (o,  2),  {b,  3),  (6, 1)  and  (c,  2). 

Definition  5  A  derivation  of  a  Horn  clause  p  -> 
q  from  a  Horn  program  H  is  a  finite  directed  acyclic 
graph  G  such  that  there  is  a  node  q,  there  is  no  arc 
{q,  r)  in  G,  and  for  each  node  I  in  G,  either  I  e  p  or 
if  ili,l),---,{^d,l)  are  the  only  arcs  of  G  terminating 
at  I,  then  h, . . .  ,ld  I  =  Gd  for  some  clause  G  e  H 
and  a  substitution  9. 

For  example,  let  H  be  {parent{x,y),parent{y,z) 
grandParent{x,z)-,  mother{x,y)  -¥  parent{x,y)}. 
Figure  1  shows  a  derivation  of  mother  {a,  b),  mot- 
her{b,c)  grandParent{a,c). 

Proposition  1  Jn  0  derivation  G  of  a  clause  p  q 
from  a  Horn  program  H,  for  any  node  I,  either  I  is  in 
p  or  H  ^  p  1. 
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grandParent{a,c) 


parent(a,b)  parent(b,c) 


tnother(a,b)  mothcr(b,c) 

Figure  1:  A  derivation  of  mother{a,  b),  mother{b,c) 
grandParent{a,  c)  from  H. 

Let  P  be  a  set  of  predicate  symbols,  and  T  be  a  set  of 
terms.  Let  L  be  a  set  of  atoms  defined  using  P  and 
T .  Let  be  a  set  of  Horn  programs  using  atoms  in 
L  only.  If  k  is  an  integer,  then  P*.  is  a  subset  of  P 
containing  only  those  predicate  symbols  of  arity  k  or 
less.  Further,  Lk  is  a  set  of  atoms  defined  using  and 
T ,  and  Hk  is  a  set  of  Horn  programs  using  atoms  in  Lk 
only.  In  the  following  three  definitions,  we  describe  a 
class  of  Horn  programs  AHk  for  which  minimal  models 
are  of  polynomial  size. 

Definition  6  (Arimura,  1997)  Let  T,  e  'H.  Then 
a  binary  relation  supported  by  (denoted,  >- )  over 
atoms  in  L  w.r.t.  E  is  such  that  (1)  for  allp^l^T,, 
and  for  all  k  e  p,  I  y  l^;  (2)  for  all  li,l2  €  L  and 
every  substitution  0,  if  li  y  I2,  then  li6  y  I26;  and  (3) 
if  h  >■  h  and  I2  y  h  then  li  y  I3. 

Definition  7  A  Horn  program  E  is  acyclic  over  L 
if  the  relation  y  over  L  w.r.t.  E  is  terminating;  i.e., 
for  any  I  e  L,  there  is  no  infinite  decreasing  sequence 
I  y  li  >-.... 

In  the  last  example,  H  is  acyclic  because  grandPar- 
ent(x,  y)  y  parent{x,  y)  y  mother{x,  y)  and  there  is 
no  cycle  formed  by  the  y  relation. 

Following  Khardon  (1998),  we  call  a  definite  clause  a 
non- generative  clause  if  the  set  of  terms  in  its  conse¬ 
quent  are  a  subset  of  the  set  of  terms  and  subterms  in 
its  antecedent. 

Definition  8  If  k  is  a  constant,  we  define  a  Horn  pro¬ 
gram  T,  £  Hk  to  be  in  the  class  AHk ,  if  T,  is  acyclic 
over  Lk,  and  each  clause  is  either  non- generative  or 
has  an  empty  antecedent. 

Definition  9  Let  a  b  be  a  clause  in  a  Horn  pro¬ 
gram  E,  and  p  q  be  a  clause.  Then,  a  b  is  a 
tcirget  clause  in  E  ofp->qiffa-^byp-^q,  i.e., 
for  a  substitution  6,  a9  C  p,  b9  =  q.  We  call  p  q  a 
hypothesis  clause  of  a  b. 

Definition  10  For  an  antecedent  p,  q'  is  a  prime 
consequent  of  p  wrt  S  i/  E  |=  p  7',  ^  p,  and 

there  is  no  I  £  L  such  that  q'  y  I,  I,  \=  p  I  and 
l^p. 


In  the  last  example,  parent{a,  b)  is  a  prime  consequent 
of  mot  her  {a,  b),  mother{b,c),  but  grandParent{a,c)  is 
not — since  parent{a,b)  y  grandParent{a,c). 

3  Learning  Horn  Programs 

In  this  section,  we  show  that  a  subclass  of  AHk  is 
exactly  learnable,  using  the  exact  learning  model  (An- 
gluin,  1988),  in  entailment  setting.  Henceforth,  E  € 
AHk  denotes  a  target  Horn  program. 

3.1  The  Learning  Model 

In  learning  from  entailment,  an  example  is  a  Horn 
clause.  An  example  p  — t  g  is  a  positive  example  of 
E  if  E  t=  p  g;  negative,  otherwise.  An  entailment 
query  takes  as  input  an  example  {p-^  q),  and  outputs 
yes  if  it  is  a  positive  example  of  E  (E  |=  p  ->  g),  and  no 
otherwise.  An  equivalence  query  takes  as  input  a  Horn 
program  H  and  outputs  yes  if  H  and  E  contain  (en¬ 
tail)  exactly  the  same  Horn  clauses;  otherwise,  returns 
a  counterexample  that  is  in  (entailed  by)  exactly  one  of 
H  and  E.  A  derivation- order  query,  takes  as  input 
two  atoms  Zj  and  I2  in  L  and  outputs  yes  if  li  y  I2, 
and  no  otherwise.  An  algorithm  exactly  learns  a  Horn 
program  E  in  AHk  in  polynomial  time  from  equiva¬ 
lence,  entailment,  and  derivation-order  (>-)  queries  if 
and  only  if  it  runs  in  time  polynomial  in  the  size  of 
E  and  in  the  size  of  the  largest  counterexample,  and 
outputs  a  Horn  program  in  AHk  such  that  equivalence 
query  answers  yes. 

3.2  The  Learning  Algorithm 

In  this  section,  we  describe  the  learning  algorithm, 
PLearn,  shown  in  Figure  2.  PLearn  always  maintains  a 
hypothesis  H  which  is  entailed  by  the  target,  so  that 
every  instance  of  H  is  also  an  instance  of  E  and  all 
counterexamples  are  positive. 

Suppose  that  a  counterexample  p  ->  g  is  given  to  the 
learner — see  Figure  2.  Every  such  counterexample  has 
a  derivation  from  the  target  theory,  E.  Since  this 
derivation  is  not  possible  from  the  current  hypothe¬ 
sis  H,  there  is  some  clause  used  in  the  derivation  that 
has  not  been  learned  with  sufficient  generality.  The  al¬ 
gorithm  tries  to  identify  the  antecedent  literals  of  such 
a  clause,  c*,  in  the  target  by  expanding  the  derivation 
graph  from  its  leaves  in  p  toward  the  goal  using  the 
clauses  in  H.  In  other  words,  PLearn  computes  the 
minimal  model  (p'j)  of  H  implied  by  p  (“closure”  or 
“saturation”)  by  forward  chaining  (line  4).  To  iden¬ 
tify  the  consequent  of  c*,  also  called  the  “prime  conse¬ 
quent”  ofp'p  PLearn  calls  PrimeCons  in  line  5.  Prime- 
Cons  finds  the  prime  consequent  of  p'^  by  tracing  the 
“supported-by”  chain  starting  from  g  for  a  literal  g/ 
not  in  p'j,  but  is  directly  supported  by  some  of  the  lit- 
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PLearn 

Given  equivalence,  entailment  and  >-  queries 

outputs  a  Horn  program  H  s.t.  equivalent?(/ir,  S)  is  Yes. 

(1)  H  =  {}  /*  empty  hypothesis-clauses  set  */ 

(2)  while  not  equivalent?(jH,E)  do  { 

(3)  Let  p  qhe  the  counterexample  returned 

(4)  p'f  =  {I  :  H  ^  (p 0}  /*  forward  chaining  */ 

(5)  qj  =  PrimeCons(p'^  -4  q) 

(6)  Pf  -^qf  =  Reduce(p/  -4  qj) 

(7)  if  Bpi  -^qi  €  H  such  that  E  |=  pj  -4  , 

(8)  where  pg -r  qg  =  lgglpi qi,Pf  ^  qf) 

(9)  then  replace  first  such  p,  -4  qi  by  Reduce(p<,  -¥  qg) 

(10)  else  append  Pf  qj  to  H 

(11)  }  /♦  while  */ 

(12)  return  H 

PrimeCons(p  -¥  q)  /*  finds  prime  consequents  ♦/ 

(13)  Let  L  be  the  set  of  all  possible  literals  having 

only  those  terms  that  are  in  p 

(14)  q'  =  q; 

(15)  L'  =  {1:1  e  L-p  and  E  p  1} 

(16)  while  31  €  L'  such  that  q'  >- 1 

(17)  g'-/; 

(18)  return  g' 

Reduce(p  — >  g)  /*  trims  irrelevant  literals  */ 

(19)  p'  =  p 

(20)  repeat 

(21)  for  each  literal  I  in  p'  in  sequence  do 

(22)  if  E  (p'  -  {1})  ^  1  and  E  1=  (p'  -  {/})  -4  g 

(23)  then  p'  =p'  —  {1} 

(24)  until  there  is  no  change  to  p' 

(25)  return  p'  —t  q 


Figure  2:  PLearn  Algorithm 

erals  in  p'f  (lines  13-18).  In  line  6,  PLearn  makes  use  of 
Reduce  to  trim  away  “irrelevant”  literals  from  the  an¬ 
tecedent  p'f  to  form  a  new  clause  p/  -4  g/  that  is  also  a 
counterexample  to  the  hypothesis  and  is  subsumed  by 
a  single  target  clause — see  Lemmas  9,2,3.  PLearn  com¬ 
bines  p/  -4  g/  with  an  “appropriate”  clause  pi  ->  Qi  in 
H  using  Igg  (lines  7-9).  It  uses  the  entailment  query 
to  find  an  appropriate  hypothesis  clause  by  checking 
if  the  result  of  Igg  is  implied  by  the  target  (line  7).  If 
no  such  clause  exists  in  H,  pf  -4  g/  is  appended  to  H 
as  a  new  clause  (line  10). 

One  problem  with  this  approach  is  that  the  size  of  the 
Igg  is  a  product  of  the  sizes  of  its  two  arguments.  This 
causes  the  size  of  a  hypothesis  clause  to  grow  expo¬ 
nentially  in  the  number  of  examples  combined  with 
it  in  the  worst  case.  To  avoid  this,  the  antecedent 
literals  of  the  clause  after  Igg  are  again  trimmed  us¬ 
ing  Reduce  so  that  the  size  of  the  resulting  clause  is 
bounded,  while  it  is  still  subsumed  by  the  target  clause 
(lines  19-25).  The  result  of  Reduce  then  replaces  the 
original  hypothesis  clause  pi  — >  qi  it  is  derived  from 
(line  9).  After  this  step,  only  the  antecedents  of  the 
target  clause  and  some  of  their  consequents  remain  in 


the  resulting  hypothesis  clause — see  Lemma  5.  This 
process  repeats  until  the  hypothesis  H  is  equivalent  to 
E.  The  algorithm  works  for  unit  clauses  (which  have 
empty  antecedents)  without  change. 

3.3  An  Example 

As  an  example  to  see  how  PLearn  works,  consider 
S  =  {h{fi3:)),l2{x),l3{x)  ->  kixy,liif{x)),l2{x) 
h{x);lilx),k{x)  -4  hix)}  where  /  is  a  function  sym¬ 
bol.  Suppose  H  =  {liific)),l2ic)  -t  ZsCc)}.  We 
adopt  the  convention  that  the  letters  such  as  o,  b,  c, 
etc.  at  the  beginning  of  the  alphabet  are  constants 
and  the  letters  at  the  end  of  the  alphabet  such  as 
X,  y,  z,  etc.  are  variables.  Let  the  counterexample 
be  Iiif{d)),l2{d),l3{d)  -»•  hid).  In  step  4,  it  does 
not  change.  In  PrimeCons,  since  hid)  y  hid)  and 
hid)  y  kid),  hid)  is  not  a  prime  consequent,  but 
any  one  of  hid)  and  kid)  is.  Suppose  PrimeCons 
returns  kid)-  Reduce  eliminates  Z3(d)  from  the  an¬ 
tecedent,  because  E  ^  hifid)),hid)  ->  kid),  and 
E  hifid)),l2id)  -4  kid).  Thus,  Pf  Qf  = 
hifid)),hid)  -t  kid).  Combining  this  with  the  clause 
in  H,  we  obtain  Pg  -t  Qg  =  hifix)),hix)  ->  kix)  is 
entailed  by  E,  new  H  is  {hifix)),l2ix)  ^5(3^)}- 

Suppose  the  next  counterexample  is  (1  (/(c)),  12(c), 
13(c)  -4  17(c).  Then,  g/  =  hie),  and 

P'f  -  {li(/(c)),  12(c),  13(c),  15(c)}.  Pf  -t  Qf  = 
li(/(c)),  12(c),  13(c),  15(c)  -4  14(c),  since  Reduce  can¬ 
not  remove  15(c),  because  it  is  implied  by  the  other 
literals  wrt  E  (line  22).  The  modified  counterexam¬ 
ple  Pf  Qf  cannot  be  combined  with  the  clause 
in  H,  because  the  resultant  Pg  -4  qg  after  Igg, 
hifix)),l2ix)  -4,  is  not  entailed  by  E.  Hence,  it 
is  appended  to  H  to  make  H  =  {hi fix)), hix)  -4 
kix);liific)),l2ic),kic),kic)  -4  14(c)}. 

Suppose  the  next  counterexample  is  again  li(/(c)), 
12(c),  13(c)  -4  17(c).  After  line  4,  p}  = 

li(/(c)),  12(c),  13(c),  15(c),  14(c).  g/  now  is  17(c),  be¬ 
cause  it  is  a  prime  consequent  of  p}.  After  Reduce, 
Pf  =  kie),hie).  Pf  -t  qf  cannot  be  combined  with 
the  clauses  in  H,  because  the  resultant  Igg’s  are  not 
entailed  by  E.  Again,  p/  — >  g/  is  added  to  H.  This 
process  continues  until  H  and  E  are  equivalent. 

To  bring  out  the  nuances  in  Reduce,  let  us  revisit 
the  last  part  of  the  previous  example.  Consider  the 
input  15(c), 12(c), 13(c), ii(/(c)), 14(c)  -4  hie)  to  Re¬ 
duce.  Although  E  [=  li(/(c)),  12(c),  13(c)  hie), 
since  E  |=  hi  fie)),  hie),  hie)  -t  14(c)  and  E  [= 
li (/(c)),  12(c)  -t  15(c),  the  literal  15(c)  cannot  be  re¬ 
moved.  This  is  because  15(c)  is  implied  by  the  other 
literals  (li (/(c)), 12(c))  wrt  E.  The  order  in  which  the 
literals  are  removed  in  Reduce  follows  the  derivation  or¬ 
der:  if  Ij  Ij,  if  at  all  U  is  removed,  it  is  removed  after 
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Ij  is  removed.  This  can  be  intuitively  imagined  in  the 
following  way.  Consider  a  derivation  tree  for  a  coun¬ 
terexample,  with  the  consequent  literal  on  top  and  the 
antecedent  literals  at  the  bottom.  The  above  process 
trims  off  the  literals  bottom-up  in  the  tree  up  to  the 
appropriate  level,  so  that  the  resulting  clau.se  is  sub¬ 
sumed  by  some  clause  in  the  target.  In  the  above  case, 
if  Reduce  removes  /5(c)  and  leaves  over  /i(/(c)),f2(c), 
the  resulting  clause  {h{f{c)),l2{c)M{c),h{c)  ->  h{c)) 
is  not  subsumed  by  any  clause  in  E. 

However,  this  means  that  Reduce  leaves  over  literals 
which  are  implied  by  the  remaining  literals,  i.e.,  I  can¬ 
not  be  removed  from  p'  ii  Y,  \=  (p'  -  {1})  I  (line 
22).  Removing  such  literals  could  result  in  hypoth¬ 
esis  clauses  which  are  not  subsumed  by  any  target 
clause,  as  the  following  example  illustrates.  Let  E  be 
{li(a)  ->■  12(a);  li(x),l2(x)  ->  /3(3;)}.  Suppose  the 
first  counterexample  is  /i(a),/2(a)  -t  /3(a).  Hence  p'j 
=  {Ii  (a),  12(a)}  and  O'/  =13(0)  in  line  6.  If  Reduce  were 
to  remove  12(a)  from  p}  because  E  |=  li(a)  ->  13(a), 
it  ends  up  with  a  clause  that  is  not  subsumed  by  any 
target  clause.  We  would  like  to  prevent  such  redun¬ 
dant  hypothesis  clauses  so  that  their  number  is  not  too 
high  compared  to  the  number  of  target  clauses.  (This 
argument  is  formalized  in  Lemmas  6,  7  and  8.) 

3.4  Learnability  of  AHk 

In  this  section,  we  prove  that  P Learn  algorithm  in  Fig¬ 
ure  2  exactly  learns  a  subclass  of  AHk  for  which  sub¬ 
sumption  is  of  polynomial-time  complexity.  The  plan 
of  the  proof  is  as  follows:  Through  a  series  of  lemmas, 
we  first  establish  that  every  hypothesis  clause  learned 
has  a  target  clause  (Lemma  6).  We  then  show  that 
every  target  clause  has  at  most  one  hypothesis  clause 
(Lemma  8).  Together,  these  two  lemmas  establish  that 
the  number  of  hypothesis  clauses  is  bounded  by  the 
number  of  target  clauses.  We  use  this  fact  and  the 
bounds  of  the  sizes  on  the  hypothesis  clauses  (estab¬ 
lished  in  Lemma  5)  to  show  that  PLearn  learns  success¬ 
fully  in  polynomial  time  (Theorems  10  and  11).  We 
then  define  a  specific  hypothesis  class  that  obeys  the 
conditions  of  these  theorems  and  prove  that  this  class 
is  learnable  (Theorem  12). 

Lemmas  2  and  3  show  that  PrimeCons  with  the  input 
p  q  finds  a  (prime)  consequent  q'  of  p  such  that 
p  q'  is  subsumed  by  a  clause  in  E. 

Lemma  2  Let  p  q  be  the  input  and  q'  be  the  out¬ 
put  of  PrimeCons.  Assume  that  q  ^  p  and  E  [=  p  — t  g. 
Then,  (1)  PrimeCons  terminates;  (2)  q'  is  a  prime  con¬ 
sequent  of  p  wrt  E. 

Proof.  (1)  Since  E  is  acyclic,  there  is  a  terminating 

sequence  q  >-  h  y  I2 _  Since  the  loop  of  lines  16-17 

can  only  iterate  as  many  times  as  the  length  of  the 


sequence,  PrimeCons  terminates. 

(2)  q'  is  such  that  Y  \=  p  q',  and  q'  ^  p  (by  lines 

15- 17).  Since  q'  is  as  in  line  17  in  the  iteration  im¬ 
mediately  prior  to  the  terminating  iteration  of  lines 

16- 17,  there  is  no  I  such  that  q'  y  I,  Y  \=  p  I  and 
I  ^  p.  Thus,  q'  is  a  prime  consequent  of  p  wrt  E.  □ 

Lemma  3  Ifq'  is  a  prime  consequent  ofp  wrt  E,  then 
there  is  a  clause  C  G  E  such  that  C  yp  q' . 

Proof.  Assume  that  q'  is  a  prime  consequent  of  p 
wrt  E.  Consider  a  derivation  G  of  p  in  E.  Let 
(li ,  q'),  ...,(ld,  q')  be  the  only  arcs  of  G  that  terminate 
at  q'.  This  implies  that  g'  for  all  €  {/i, . . .  ,/d}. 
It  must  be  that  every  /j  is  in  p;  otherwise,  there 
is  an  /  (viz.  U)  such  that  9'  >-  /,  E  |=  p  -^  / 
and  /  ^  p — contradicting  the  assumption  that  q'  is 
a  prime  consequent.  Thus,  {/i,...,/d}  C  p.  But, 
li,. ..  ,ld  q'  =  C9  for  some  clause  C  e  H  and  a 
substitution  6,  following  the  definition  of  derivation. 
Thus,  C0  C  p  q' ,  implying  that  C  y  p  -r  q'.  □ 

The  following  definition  and  Lemmas  4  and  5  help 
show  that  Reduce,  given  a  clause  p  q  as  input, 
removes  irrelevant  literals  from  antecedent  p,  while 
maintaining  g  as  a  consequent. 

Definition  11  If  a  is  a  conjunction,  closure  of  a 
with  respect  to  Y,  denoted  by  Kq,  is  defined  as  {/|E  |= 
(a  -t  /)}. 

Lemma  4  If  q  is  a  prime  consequent  of  p  and  p'  -t 
q  =  Reduce(p  — >  q),  then  q  is  a  prime  consequent  ofp' 
also. 

Proof.  Because  g  is  a  prime  consequent  of  p  and  p'  C 
p,  any  literal  other  than  the  ones  in  p  -  p',  cannot  be 
prime  consequents  of  p'.  By  lines  22-23,  only  those 
literals  I  that  are  not  supported  by  p'  arc  removed. 
In  which  case,  no  literal  /  in  p  -  p',  can  be  such  that 
E  1=  p'  ->  /.  Hence,  g  is  a  prime  consequent  of  p'  as 
well.  □ 

Lemma  5  If  the  input  p  q  to  Reduce  is  s.t.  q  is  a 
prime  consequent  ofp  wrt  E,  then  the  output  p'  q 
is  such  that  p'  C  Kas  where  a  b  is  a  clause  in  E  and 
a6  C  p'  and  bO  —  q. 

Proof.  Since  g  is  a  prime  consequent  of  p,  by 
Lemma  4,  g  is  a  prime  consequent  of  p'  also.  Then,  by 
Lemma  3,  there  is  a  clause  a  ->  6  G  E,  and  a  6  such 
that  aO  C  p'  and  b6  =  g.  We  now  show  that  p'  C  KaO- 
Assume  that  there  exists  a  literal  in  p'  —  Kao-  Let 
I  £  p'  -  KaO  be  a  least  such  literal  so  that  there  is  no 
literal  /'  in  p'  —  Kae  such  that  ly  V .  Such  a  literal  must 
exist,  because  E  is  acyclic.  There  are  two  reasons  for 
/  to  remain  in  p'  -  KaS'.  either  (a)  E  ^  (p'  -  {/})  ->  g 
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or  (b)  E  1=  (p'  -  {/})  1.  We  disprove  both  the  cases: 

(a)  Since  a9  C  p',  and  I  is  not  in  Kae  and  thus  not  in 
aO,  aO  C  ip'  -  {Z}).  Therefore,  E  ^  (p'  -  {Z})  q. 

(b)  The  only  other  reason  why  I  remains  in  p'  is  that 

E  1=  (p'  —  {Z})  1.  That  means  that  p'  —  {Z}  contains 

literals  that  imply  Z.  There  must  be  at  least  one  such 
literal  in  p'  that  is  not  in  Kae,  or  else  I  €  Kae,  contra¬ 
dicting  I  ep'  -Kae-  But  then  p'  -  Kae  contains  literals 
I'  such  that  I  y  I',  which  contradicts  the  statement 
that  there  is  no  such  I'.  Thus,  we  disprove  both  the 
possibilities.  Hence,  p'  C  Kae  ■  0 

Lemmas  6,  7  and  8,  below,  show  that  PLearn  only 
maintains  right  clauses  in  H. 

Lemma  6  Every  clause  Pi  qi  €  H  has  a  target 
clause. 

Proof.  We  first  show  that  each  pi  ->  €  if  is  such 

that  qi  is  a  prime  consequent  of  pj.  Then,  by  Lemma  3, 
Pi  qi  has  a  clause  C  6  E  such  that  C  y  Pi  qi- 

We  show  that  qi  is  a  prime  consequent  of  pi  by  induc¬ 
tion  on  the  number  of  times  a  clause  at  position  i  in 
H  is  updated.  It  is  first  introduced  by  line  10.  By 
Lemmas  2  and  4,  q/  is  a  prime  consequent  of  p/.  This 
proves  the  base  case.  The  other  way  a  clause  becomes 
a  hypothesis  clause  is  by  line  9.  The  clause  at  position 
i  in  if  (pi  ->  qi)  is  updated  by  line  9.  As  inductive  hy¬ 
pothesis,  assume  that  each  p,  -¥  qi  in  H  is  such  that  qi 
is  a  prime  consequent  of  pi,  at  the  beginning  of  an  iter¬ 
ation  of  the  loop  of  lines  2-11  when  position  i  in  H  is 
updated.  Consider  Pg  -*■  qg  =  IggiPi  quPf  ->  9/)- 
Suppose  qg  is  not  a  prime  consequent  of  pp,  but  q'g  such 
that  qg  >-  q'g  is.  Let  9f  and  6i  be  substitutions  such 
that  Pg6f  C  pf,  qgOf  -  qf,  PgOi  C  pi,  and  qgOi  =  qi- 
Let  q'f  =  q'g6f  and  g-  =  q'gOi.  Since  qg  y  q'g,  by  the 
definition  of  order,  qf  y  q'f  and  qi  y  q'i-  Since  q/  is 
a  prime  consequent  of  p/,  q'f  must  be  in  pf.  Similarly, 
q'i  must  be  in  qi.  Therefore,  lgg{q'i,q'f)  =  q'g  must  be 
in  pg,  contradicting  the  assumption  that  q'g  is  a  prime 
consequent  of  pg.  Hence,  qg  is  a  prime  consequent  of 
Pg.  By  Lemma  4  if  p,  gi  =  Reduce(pp  ->  qg),  then  qi 
is  a  prime  consequent  of  pj.  So  by  Lemma  S,  pi  qi 
has  a  target  clause.  □ 

Lemma  7  If  PLearn  combines  a  modified  counterex¬ 
ample  Pf  qf  with  a  clause  pi  qi  ^  H,  then 
there  is  a  target  clause  C  s.t.  C  y  Pf  qf  and 
C  ypi  qi.  Further,  there  is  no  C  s.t.  C  y  pj  ->  qj 
and  C  ypf  qf,  for  any  j  <  i. 

Proof.  PLearn  combines  Pf  qf  with  pi  ->  qi  only 
if  E  t=  Iggipi  quPf  ->  qf)-  By  Lemma  6,  qg  is  a 
prime  consequent  of  Pg  where  Pg  ^  qg  =  IggiPi 
Qi,Pf  qf)-  By  Lemma  3,  there  is  a  (7  G  E  such  that 
C  ypg  -t  qg-  Hence,  C  ypi  qi  and  C  ypf  qf- 


Since  p/  ->  qf  is  combined  with  pi  ->  qi,  for  any  j  <  i, 
s  IggiPj  -t  qj,Pf  -t  q/)-  Therefore,  there  is  no  C 
s.t.  C  y  Iggipj  qj,Pf  q/)-  Thus,  there  is  no  C 
s.t.  C  ^  Pj  ->•  qj  and  C  ypf  qf-  □ 

Lemma  8  Every  clause  C  G  E  has  at  most  one  hy¬ 
pothesis  clause. 

Proof.  First,  we  show  that  any  new  hypothesis  clause 
added  to  ff  has  a  target  clause  distinct  from  the  target 
clauses  of  the  other  hypothesis  clauses  in  H.  Next, 
we  show  that  if  two  hypothesis  clauses  do  not  have 
common  target  clauses  at  the  beginning  of  an  iteration 
of  the  loop  of  lines  2-11,  then  they  still  have  distinct 
target  clauses  at  the  end  of  the  iteration. 

When  Pf  qf  is  added  to  H,  by  Lemma  7,  for  any 
clause  Hi  in  H,  there  is  no  (7  G  E  such  that  C  y 
Hi  and  C  y  pf  ^  qf.  Therefore,  Pf  ^  qf,  a.  new 
clause  added  to  J7,  has  a  target  clause  distinct  from 
the  target  clauses  of  the  other  hypothesis  clauses  then 
in  H.  Next,  at  most  one  of  Hi  and  Hj  can  change  in  an 
iteration  of  the  loop.  If  neither  changes,  we  are  done 
with  the  proof.  Suppose  that  Hi  changes,  without 
loss  of  generality.  Let  C  be  any  target  clause  of  Hj. 
Assume  that  Hi  and  Hj  do  not  have  a  common  target 
clause  at  the  beginning  of  an  iteration.  Hence,  (7  is 
not  a  target  clause  of  Hi.  That  is,  (7  Hi.  Let  e 
be  the  counterexample  for  the  current  iteration.  We 
first  show  that  lgg{Hi,e)  does  not  have  C  as  a  target 
clause.  Since  C  Yl  Hi,  C  ^  lgg{Hi,e).  Therefore,  C 
is  not  a  target  clause  of  lgg{Hi,e).  Let  lgg{Hi,e)  be 
Pg  qg,  and  C  be  o  6.  Hence,  for  every  9,  either 
off  2  Pg  or  b9  ^  qg.  li  a9  %  Pg,  a9  is  not  a  subset 
of  any  subset  of  Pg.  Since  Reduce  outputs  a  clause 
with  a  subset  of  pg  as  the  antecedent  and  qg  as  the 
consequent,  (7  Reduce{lgg{Hi,e)).  Therefore,  Hj 
and  the  new  clause  in  position  i,  Reduce{lgg{Hi,e)), 
do  not  have  a  common  target  clause  even  at  the  end 
of  the  iteration.  D 

The  following  lemma  shows  that  even  after  the  modifi¬ 
cations  due  to  PrimeCons  and  Reduce  counterexample 
remains  a  counterexample. 

Lemma  9  p/  g/  as  in  line  6  of  PLearn  is  a  positive 
counterexample. 

Proof.  First,  we  show  that  every  counterexample  p  ^ 
q,  as  in  line  3,  is  a  positive  counterexample.  Then,  we 
argue  that  p'j  qf  (lines  4  and  5)  is  also  a  positive 
counterexample.  Finally,  we  show  that  p/  g/  (line 
6)  is  a  positive  counterexample. 

Since,  by  Lemma  6,  for  every  Hi  G  H,  there  is  a  clause 
(7  G  E  such  that  C  yHi,'E\=  H.  Therefore,  p-tq,  as 
in  line  3,  is  a  positive  counterexample.  Since  p  C  p'^, 
S  ^  p'^  g.  Since  p'f  contains  all  and  only  those 
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literals  I  such  that  H  \=  p  I,  for  any  literal  V  ^  p'j, 
H  p'j  ^  V  ■  Since  qf  (by  lines  5  and  15)  is  not 
in  p'f,  H  p'j  ^  Qf.  By  line  15,  E  [=  ->  qf. 

Therefore,  p'f  qf  is  also  a  positive  counterexample. 
Finally,  since  Pf  Q  p'f,  H  pf  qf.  By  lines  6 

and  22,  E  )=  p/  -t  g/.  Thus,  p/  g/  is  a  positive 
counterexample.  □ 

Finally,  Theorem  10  shows  that  PLearn  exactly  learns 
AHk  when  forward  chaining  using  H  is  of  polynomial¬ 
time  complexity.  Theorem  11  identifies  conditions  on 
E  such  that  PLearn  returns  an  H  for  which  time  com¬ 
plexity  of  forward  chaining  is  polynomial. 

Theorem  10  PLearn  exactly  learns  AHk  with  equiva¬ 
lence,  >-,  and  entailment  queries,  provided  determining 
H  \=  p  I  is  polynomial  in  the  sizes  of  H  and  p. 

Proof.  By  Lemma  9,  p/  g/  is  a  positive  coun¬ 
terexample.  For  each  counterexample,  either  a  new 
antecedent  is  added  (line  10)  or  an  existing  antecedent 
is  replaced  (line  9).  In  the  latter  case,  the  replaced 
clause  Pi  qi  must  be  subsumed  by  the  replacing 
clause  p'  qg,  since  both  Igg  and  Reduce  generalize 
the  original  clause  by  turning  constants  to  variables 
and  dropping  literals.  On  the  other  hand,  the  replaced 
clause  must  not  subsume  (and  hence  be  different  from) 
the  replacing  clause  p'  ->  g^  =  Reduce(pg  qg).  If 
not,  that  is  if  Pi  qi  >:  p'  -¥  qg ,  since  p'  qg  h  Pg 
QgtP/-^  Qf,  Pi  Qi  h  Pf  Qf  -  Since  Pi  Qi  e  H, 
H  \=  Pf  qf — thus  contradicting  that  Pf  qf  was 

a  counterexample  of  H.  Hence,  the  replacement  at  a 
position  in  H  changes  the  clause  at  that  position.  The 
minimum  change  there  can  be  is  either  a  variablization 
of  a  constant  or  a  removal  of  a  literal. 

Let  n  be  the  number  of  clauses,  and  s  be  the  number 
of  distinct  predicate  symbols  in  E.  Further,  let  the 
maximum  number  of  terms  in  any  clause  be  t,  and  in 
any  counterexample  be  tg  ■ 

The  maximum  possible  number  of  literals  there  can 
be  using  t  terms  is  at  most  st*.  Hence,  the  maxi¬ 
mum  number  of  literals  in  Ka,  and  therefore,  by  Lem¬ 
mas  5  and  6,  in  each  clause  is  at  most  st*.  This  in¬ 
cludes  all  literals  and  their  variablized  versions.  Hence, 
we  can  consider  variablization  as  removing  a  literal. 
Thus,  we  need  at  most  st*  counterexamples  for  each 
clause.  (This  includes  one  base  counterexample  to  in¬ 
troduce  a  clause  into  H.)  By  Lemmas  6  and  8,  there 
are  at  most  n  clauses  in  H.  Hence,  we  need  at  most 
nst*  counterexamples  or  equivalence  queries.  A  call 
to  PrimeCons  from  line  5  takes  at  most  stg  entailment 
queries,  because  the  literals  we  need  to  try  as  possible 
consequents  are  all  in  L,  and  \L\  <  st'^.  PrimeCons  is 
called  once  for  each  of  the  counterexamples. 

For  each  of  the  nst'^  counterexamples,  the  condition 


in  line  7  is  tested  at  most  n  times,  which  needs  at 
most  n  entailment  queries.  Reduce  is  called  with  the 
argument  p'f  qf  once  for  each  of  the  counterexam¬ 
ples,  and  with  the  arguments  pg  ->  qg  for  at  most 
nst'^  counterexamples.  In  Reduce(p  ->  g),  in  |p|  iter¬ 
ations  of  the  loop  of  lines  21-23,  at  least  one  literal 
is  removed.  So,  this  loop  can  be  tried  at  most  |p| 
times.  Each  iteration  of  the  loop  of  lines  21-23  takes 
two  entailment  queries.  Therefore,  Reduce(p  ->  g) 
needs  at  most  |p|(|p|  -f  1)  entailment  queries.  Hence, 
Reduce(p^  qf)  needs  at  most  n/  =  st'^{st^  -f  1) 
entailment  queries.  Since  p;  g^  and  pf  ->  qf 
are  outputs  of  Reduce,  the  maximum  possible  num¬ 
ber  of  literals  in  pg  qg  =  lgg{pi  -t  gi,p/  ->  g/) 
is  at  most  Hence,  Reduce(pg  -t  qg)  needs  at 

most  Ug  =  -b  1)  entailment  queries.  Thus, 

the  total  number  of  entailment  queries  is  at  most 
ns<*(s<*  -b  n  -b  n/  -f  Ug). 

If  determining  H  \=  {p  1)  takes  V{n,l,tg)  time 
where  F*  is  a  polynomial,  then  line  4  takes  at  most 
st*  •  'P{n,l,tg)  time.  In  the  rest,  the  number  of  en¬ 
tailment  queries  dominates  the  time.  Hence,  the  time 
taken  by  PLearn  is  polynomial  in  n,s,l,v,t,  and  tg.  □ 

Definition  12  Let  p  q  be  a  Horn  clause,  p'  q 
is  called  its  antecedent  expansion  if  p  C  p'  and  p' 
contains  only  those  variables  in  p.  A  class  C  of  Horn 
sentences  is  closed  under  antecedent  expansion,  if  ev¬ 
ery  Horn  sentence  obtained  by  selecting  a  subset  of  its 
Horn  clauses  and  replacing  them  with  their  antecedent 
expansions  is  also  in  C. 

Definition  13  A  subsumption  algorithm  takes  a 
clause  a  b,  a  conjunction  of  literals  p,  and  a  ground 
substitution  6  for  the  variables  in  b,  and  returns  true 
if  and  only  if  a6  y  p. 

Theorem  11  PLearn  exactly  learns  a  subclass  C  of 
AHk  with  equivalence,  y,  and  entailment  queries,  pro¬ 
vided  that  (a)  C  is  closed  under  substitution  and  an¬ 
tecedent  expansion  and  (b)  the  clauses  a  b  of  the 
target  concepts  in  C  have  a  polynomial-time  subsump¬ 
tion  algorithm. 

Proof.  By  Lemma  5,  each  clause  pi  q,  £  H  in 
PLearn  has  a  target  clause  a  ->  6  and  a  substitution 
9  such  that  aO  C  pi  C  Kob-  Since  the  target  class  is 
closed  under  substitution  and  antecedent  expansion, 
the  hypothesis  clauses  have  a  polynomial-time  sub¬ 
sumption  algorithm.  Hence,  the  forward-chaining  step 
of  computing  the  consequents  of  p  in  line  4  of  PLearn 
can  be  done  in  polynomial  time  by  repeatedly  check¬ 
ing  for  a  hypothesis  clause  a  —>  b  whose  antecedent 
subsumes  p  after  a  substitution  6  of  the  variables  in  b, 
and  adding  bS  to  p.  Hence,  by  the  previous  theorem, 
PLearn  exactly  learns  C.  □ 
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The  following  definition  and  theorem  identify  some 
syntactic  restrictions  on  AHk  such  that  the  resulting 
subclass  satisfies  the  conditions  of  the  previous  theo¬ 
rem. 

Definition  14  Let  p  be  a  set  of  literals.  A  Horn 
clause  li,...,ln  -t  Q  is  i-determinate  w.r.t.  p 
iff  there  exists  an  ordering  lo\i  ■  •  •  ffon  ■  iln 

such  that  for  every  i  <  j  <  n  and  every  substitu¬ 
tion  6  such  that  {loi,  ■  ■  ■  Joj-i  -t  <l)6  is  ground  and 
{/oi, . .  -ffoj-iW  C  p,  there  is  at  most  one  substitution 
a  for  the  variables  in  lojO  such  that  lojGa  is  ground 
and  is  in  p.^  We  call  such  an  ordering  of  the  literals 
in  the  clause  an  i-determinate  ordering  w.r.t.  p.  A 
Horn  program  is  i-determinate  w.r.t.  p  iff  each  of  the 
clauses  in  the  program  is  i-determinate  w.r.t.  p. 

Theorem  12  The  class  of  i-determinate  Horn  pro¬ 
grams  in  AHk,  denoted  as  iDetAHk,  is  exactly  leam- 
able  with  equivalence,  >-,  and  entailment  queries. 

Proof.  First  we  show  that  iDetAHk  is  closed  un¬ 
der  substitution  and  antecedent  expansion.  Consider 
a  target  clause  (/i  g)  for  a  target  program  in 

iDetAHk,  whose  antecedent  literals  are  sorted  in  the 
determinate  order.  Let  (fi , ...  ,1^,  fn+i )  *  •  •  ? 
be  the  target  clause  after  antecedent  expansion  and 
substitution.  We  want  to  show  the  new  clause  to  be 
i-determinate. 

For  every  set  of  literals  p,  substitution  0,  and  j  such 
that  i  <  j  <  m  and  (h, . . .  ,lj-i)00  C  p  is  ground, 
there  is  a  substitution  7  which  is  equivalent  to  applying 
/3  and  6  one  after  another  so  that  {h,..  .,lj-i)l36  = 
and  lj^6  =  Ij'y  for  any  Ij.  Since  the 
target  clause  satisfies  i-determinacy,  there  must  be  at 
most  a  single  ground  substitution  a  for  Ijj,  j  <  n,  so 
that  Ij'ja  e  p,  which  means  that  this  is  true  for  IjfSO 
as  well.  Since  the  literals  from  i„+i  through  Im  do  not 
have  any  variables  not  already  in  h  through  there  is 
at  most  a  single  ground  substitution  for  them  as  well. 
Hence,  (Ji,. . .  ,lm  q)P  is  also  i-determinate. 

Now  we  show  that  the  clauses  of  the  programs  in 
iDetAHk  have  a  polynomial-time  subsumption  al¬ 
gorithm.  Given  a  set  of  literals  p  and  a  clause 
li,...,ln  ->  q  (whose  literals  have  an  unknown  de¬ 
terminate  ordering),  consider  all  possible  subsets  of 
{ii, . . . , In)  of  size  i  and  less.  Note  that  there  are  at 
most  0(n*)  such  subsets.  For  each  such  subset,  instan- 

'^This  definition  strictly  generalizes  the  standard  defini¬ 
tion  of  determinacy  (Muggleton  &:  Feng,  1990),  in  that  a 
Horn  clause  (program)  is  determinate  w.r.t.  a  set  of  lit¬ 
erals  p  when  it  is  0-determinate  w.r.t.  p.  i-determinacy 
should  not  be  confused  with  ij-determinacy,  or  constant- 
depth  fixed-arity  determinacy,  which  is  more  restricted 
than  determinacy. 


tiate  all  the  ki  variables  in  that  subset  in  all  possible 
ways.  If  the  total  number  of  terms  in  p  and  S  is  t, 
this  gives  us  t*’*  different  substitutions.  For  each  such 
substitution,  there  is  at  most  one  substitution  for  the 
remaining  literals  in  the  clause.  The  order  in  which  the 
remaining  literals  have  to  be  substituted  can  be  deter¬ 
mined  by  sequential  search — apply  the  current  substi¬ 
tution  to  each  literal  and  pick  the  one  that  only  allows 
one  possible  substitution  for  its  remaining  variables. 
This  can  be  done  in  0{n‘^\p\)  time.  If  the  antecedent 
li,...,ln  subsumes  p,  then  one  of  the  considered  sub¬ 
sets  should  yield  a  successful  match.  Hence,  the  total 
time  for  the  algorithm  is  bounded  by  0(n*t*^®n^|p|), 
which  is  polynomial  in  all  variables  except  k  and  i 
which  are  assumed  to  be  constants. 

Since  the  class  iDetAHk  satisfies  the  two  conditions 
required  by  Theorem  11  for  PLearn  to  be  successful, 
the  result  follows.  □ 

4  Discussion  and  Conclusions 

In  this  paper,  we  have  shown  the  learnability  of  cer¬ 
tain  subclasses  of  acyclic  fc-ary  Horn  programs.  More 
specifically  i-determinate  Horn  programs  in  AHk,  are 
exactly  learnable  with  equivalence  and  entailment 
queries.  Unlike  the  work  of  Page  (1993)  and  Arimura 
(1997),  the  programs  we  considered  allow  local  vari¬ 
ables  in  the  antecedents.  However,  the  clauses  must 
be  non-generative  in  that  the  set  of  terms  and  vari¬ 
ables  that  occur  in  the  head  of  the  clause  must  be  a 
subset  of  those  that  occur  in  the  body  of  the  clause. 
This  is  needed  to  constrain  the  forward-chaining  in¬ 
ference  step  to  finish  in  polynomial-time,  which  could 
otherwise  become  unbounded.  It  appears  that  simul¬ 
taneously  removing  both  the  non-generative  and  sim¬ 
plicity  restrictions  could  be  difficult  when  functions  are 
present,  due  to  the  unbounded  nature  of  inference  in 
that  case. 

Learning  from  entailment  and  learning  from  interpre¬ 
tations  are  two  of  the  standard  settings  for  first-order 
learning  (De  Raedt,  1997).  In  learning  from  inter¬ 
pretations,  the  learner  is  given  a  positive  (or  neg¬ 
ative)  interpretation  for  which  the  Horn  sentence  is 
true  (or  false).  Interpretations  can  be  partial  in  that 
the  truth  values  of  some  ground  atoms  may  be  left 
unspecified.  When  membership  queries  are  available, 
learning  from  entailment  and  learning  from  interpre¬ 
tations  are  equivalent  for  Horn  programs.  Hence  we 
can  use  PLearn  to  learn  from  (negative)  interpreta¬ 
tions  as  follows.  Given  a  negative  interpretation,  “min¬ 
imize”  it  by  removing  the  negative  literals  from  it 
and  asking  membership  queries.  Since  every  nega¬ 
tive  interpretation  must  violate  some  Horn  clause,  this 
yields  an  interpretation  with  a  set  of  positive  liter¬ 
als  li,...,ln  and  at  most  one  negative  literal  qi.  We 
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can  convert  this  into  a  positive  counterexample  for 
PLearn:  li  A  ...  A  In  qi-  Similarly,  if  PLearn  asks 
an  entailment  membership  query  on  some  clause,  say, 
Zi  A  . . .  A  in  ->  9t,  we  can  turn  that  into  a  membership 
query  on  the  interpretation  k, . . .  ,ln,~'qi  after  substi¬ 
tuting  a  unique  skolem  constant  for  each  variable  in 
the  clause.  The  answer  to  the  entailment  query  is  true 
iff  the  answer  to  the  membership  query  is  false. 

One  limitation  of  our  algorithm  is  that  it  assumes  that 
the  supported  by  relation,  is  given.  While  this  is 
a  reasonable  assumption  in  some  planning  domains, 
where  it  is  known  which  goals  occur  as  subgoals  of 
which,  it  is  desirable  to  learn  this  relation.  Unfortu¬ 
nately,  this  seems  difficult  due  to  a  number  of  prob¬ 
lems.  One  of  the  main  difficulties  is  that  it  is  some¬ 
times  not  possible  to  determine  which,  of  the  set  of 
consequents  of  an  antecedent,  is  the  prime  consequent. 
For  example,  consider  the  target  E  :  {li{x)  A  hix) 
h{x)\  h{x)  A  Isix)  /4(a;)}.  Given  the  counterexam¬ 
ple  li{c)  A  hie)  hie),  the  literal  hie)  is  not  a  cor¬ 
rect  consequent,  but  hie)  is.  Although  Lemma  3  says 
that  prime  consequent  is  a  right  consequent  to  choose, 
without  knowing  the  order  it  is  not  clear  how  to  iden¬ 
tify  it.  Learning  all  possible  clauses  while  maintaining 
all  consequents  also  does  not  seem  to  work,  resulting 
in  spurious  matches  between  some  of  these  redundant 
clauses  and  counterexamples  in  some  cases. 

As  shown  in  (Reddy  &  Tadepalli,  1997b),  Horn  pro¬ 
grams  can  be  used  to  express  goal-decomposition  rules 
(d-rules)  for  planning  using  the  situation-calculus  for¬ 
malism.  We  believe  that  the  algorithm  discussed  here 
and  its  extensions  can  be  applied  to  learn  d-rules, 
which  is  an  important  problem  in  speedup  learning, 
d-rules  are  a  special  case  of  hierarchical  task  net¬ 
works  or  HTNs  (Erol,  Hendler,  &  Nau,  1994) — in 
that  HTNs  allow  partial  ordering  over  subgoals  and 
non-codesignation  constraints  over  variables  whereas 
d-rules  do  not.  Nevertheless,  it  can  be  shown  that 
HTNs  can  be  expressed  as  Horn  programs. 
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Abstract 

This  paper  introduces  the  RL-TOPs  archi¬ 
tecture  for  robot  learning,  a  hybrid  system 
combining  teleo-reactive  planning  and  rein¬ 
forcement  learning  techniques.  The  aim  of 
this  system  is  to  speed  up  learning  by  de¬ 
composing  complex  tasks  into  hierarchies  of 
simple  behaviours  which  can  be  learnt  more 
easily.  Behaviours  learnt  in  this  way  can 
subsequently  be  re-used  to  solve  a  variety  of 
problems,  reducing  the  need  to  learn  every 
new  task  from  scratch.  It  is  even  possible 
to  learn  multiple  behaviours  simultaneously, 
thus  making  more  efficient  use  of  experience. 

We  demonstrate  these  advantages  in  a  simple 
simulated  environment. 


1  INTRODUCTION 

Programming  robots  is  difficult  (Dorigo,  1996).  Of¬ 
ten  the  best  way  for  the  robot  to  solve  a  problem  is 
unknown,  or  hard  to  express.  The  real  world  is  dy¬ 
namic,  and  to  be  truly  autonomous,  robots  need  to  be 
able  to  cope  with  a  changing  environment  (Covigaru  & 
Lindsay,  1991).  Robot  programming  would  be  greatly 
simplified  if  robots  were  able  to  learn  appropriate  be¬ 
haviours  of  their  own  accord,  and  could  adapt  those 
behaviours  to  changes  in  the  world  around  them.  Re¬ 
inforcement  Learning  (RL)  provides  an  elegant  theo¬ 
retical  framework  to  achieve  these  goals  but  often  fails 
in  practice  due  to  the  “curse  of  dimensionality”  oper¬ 
ating  in  large  state  spaces  and  with  complex  problems 
such  as  those  typically  found  in  real  robot  domains.  As 
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the  number  of  states  grows,  the  problem  of  determin¬ 
ing  the  best  action  to  perform  in  each  state  becomes 
impossibly  difficult. 

This  problem  is  not  peculiar  to  RL,  traditional  robot 
programmers  have  faced  it  also.  It  is  generally  not 
feasible  to  produce  a  single  monolithic  control  system 
which  handles  all  possibilities.  Instead,  the  trend  has 
been  towards  behaviour-based  programming  (Mataric, 
1996).  A  complex  task  is  decomposed  into  a  set  of 
simple  modules  or  behaviours,  each  of  which  handle 
a  small  part  of  the  problem.  These  are  more  easily 
programmed,  and  can  then  be  combined  to  solve  the 
full  problem. 

One  such  technique.  Brook’s  subsumption  architecture 
(Brooks,  1986),  has  been  successfully  transferred  to 
the  RL  domain,  to  simplify  learning.  Mahadevan  and 
Connell  (Mahadevan  &  Connell,  1992)  showed  that 
a  complex  learning  task  (robot  box-pushing),  which 
could  not  be  learnt  by  a  simple  reinforcement  learner, 
could,  however,  be  learnt  by  decomposing  it  into  a 
subsumption-style  hierarchy  of  simple  behaviours,  and 
learning  each  of  these  behaviours  as  distinct  reinforce¬ 
ment  learning  tasks.  Thus  the  robot  effectively  had 
several  separate  learning  modules,  each  of  which  works 
independently  to  learn  a  sub-part  of  the  task,  but 
which  can  all  cooperate  together  to  provide  the  overall 
solution  to  the  problem. 

Task  decomposition  of  this  kind  is  well  recognised  as 
a  way  to  improve  learning  rates.  As  each  module  only 
has  to  learn  its  behaviour  on  a  small  subset  of  possible 
states,  its  search-space  is  reduced,  and  so  it  can  find 
the  optimal  policy  more  quickly.  Other  authors  to  have 
produced  algorithms  based  on  this  realisation  include 
Kaelbling’s  HDG  (Kaelbling,  1993),  Dayan  and  Hin¬ 
ton’s  Feudal  Reinforcement  (Dayan  k  Hinton,  1992) 
and  Dietterich’s  MAXQ  (Dietterich,  1997)  algorithms. 
These  algorithms  differ  from  Mahadevan  and  Connell’s 
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in  that  they  are  based  on  more  geometrical  decom¬ 
positions  of  the  world,  rather  than  using  specific  do¬ 
main  knowledge  to  define  the  behaviours.  Because  of 
this,  they  appear  to  be  less  applicable  to  problems  in 
robotics,  which  involve  high-dimensional  state  infor¬ 
mation,  from  a  variety  of  sensing  apparatus,  without 
a  simple  uniform  geometry. 

The  advantages  of  the  subsumption-architecture,  how¬ 
ever,  are  offset  by  the  rigidity  of  the  representation 
used.  The  hierarchy  has  to  be  designed  by  hand  by 
the  programmer,  which  can  be  a  non-trivial  task  for 
many  problems.  What  is  more,  a  new  task  requires  a 
new  set  of  behaviours  and  a  new  hierarchy.  Is  it  possi¬ 
ble  to  design  a  more  flexible  system  that  can  automat¬ 
ically  build  behaviour  hierarchies  to  solve  particular 
problems?  Can  behaviours  learnt  to  solve  one  task  be 
re-used  to  accelerate  the  learning  of  others?  These  are 
the  questions  that  this  paper  seeks  to  address. 

2  TELEO-REACTIVE  PLANNING 

This  problem  of  selecting  and  ordering  an  appropri¬ 
ate  set  of  predefined  behaviours  to  achieve  a  cer¬ 
tain  goal  has  traditionally  been  the  domain  of  plan¬ 
ning  algorithms.  Historically,  planning  systems  have 
been  deemed  unsuitable  for  robot  control,  because 
they  failed  to  model  the  complexity  of  the  real  world. 
Plans  were  based  on  sequences  of  instantaneous  ac¬ 
tions,  which  were  expected  to  succeed  every  time;  but 
in  the  real  world  actions  take  time  to  perform,  and 
are  not  always  reliable.  However  modern  planning  al¬ 
gorithms  are  now  able  to  produce  plans  which  closely 
resemble  the  behaviour  based  architectures  of  Brooks 
and  others.  Plans  can  now  include  durative  actions, 
which  operate  over  a  period  of  time.  Execution  of 
plans  is  reactive  (i.e.  the  state  of  the  world  is  con¬ 
stantly  re-evaluated  to  determine  which  action  to  per¬ 
form),  and  universal  (i.e.  contingencies  exist  for  all  sit¬ 
uations)  . 

One  such  planner  is  Nilsson’s  Telco-Reactive  (TR) 
planning  system  (Nilsson,  1994).  It  is  based  around 
the  notion  of  a  teleo- operator  (or  TOP),  which  is  a 
means  of  describing  a  durative  action  in  terms  of  its 
conditions  and  effects.  A  TOP  consists  of  an  action  a, 
a  pre-image  n  and  a  post-condition  A.  The  pre-image 
and  post-condition  are  conjunctions  of  predicates  from 
the  planner’s  state  description  language.  The  action 
may  be  a  simple  primitive  action,  or  may  be  a  com¬ 
plex  behaviour  in  its  own  right.  The  TOP  o  :  tt  A 
signifies  that  if  a  is  executed  while  tt  is  true,  then  A 
will  eventually  become  true.  Until  such  time  as  A  is 


achieved,  tt  is  maintained^. 

Teleo-reactive  plans  are  represented  as  structures 
called  TR-Trees.  Nodes  in  TR- Trees  represent  state 
descriptions,  with  the  root  node  as  the  goal.  Connec¬ 
tions  between  nodes  are  labelled  with  actions,  indicat¬ 
ing  that  if  the  action  shown  is  executed  in  the  lower 
node,  then  the  condition  of  the  upper  node  will  even¬ 
tually  be  achieved. 

TR-trees  are  executed  reactively.  The  nodes  in  the 
tree  are  continually  re-evaluated  and  the  action  corre¬ 
sponding  to  the  shallowest  true  node  is  executed.  If 
at  any  time  there  is  no  true  node  in  the  tree,  then  the 
planner  can  be  reactivated  to  grow  the  plan  to  cover 
the  new  situation;  thus  TR-trees  represent  (near-)  uni¬ 
versal  plans. 

3  REINFORCEMENT  LEARNT 
BEHAVIOURS  AND 
TELEO-OPERATORS 

Like  TOPs,  behaviours  acquired  by  reinforcement 
learning  are  also  durative  actions  with  a  pre-image  (ap¬ 
plication  space)  and  a  post-condition  (goal).  Given  a 
suitable  language  to  describe  these  attributes,  a  set  of 
reinforcement  learnt  behaviours  can  easily  be  repre¬ 
sented  as  a  list  of  TOPs.  A  TR-planner  could  then  be 
used  to  combine  these  behaviours  automatically  into  a 
hierarchy  to  solve  a  given  problem,  removing  the  need 
for  the  programmer  to  do  this  by  hand. 

Furthermore,  the  same  TOP  descriptions  can  also  be 
used  at  the  lower  level  as  reinforcement  schema  for  the 
learning  algorithm:  The  post-condition,  if  achieved, 
indicates  success,  which  should  be  rewarded.  Prema¬ 
turely  quitting  the  pre-image  indicates  failure,  which 
carries  a  punishment.  Thus  the  one  description  has 
two  functions:  it  is  used  at  the  high  level  to  tell  the 
planner  how  to  use  the  behaviour,  and  at  the  low  level 
to  tell  the  learner  what  it  is  trying  to  learn.  This  du¬ 
ality  is  the  basis  of  the  Reinforcement  Learnt  TOPs 
(RL-TOPs)  system. 

4  THE  RL-TOPS  ARCHITECTURE 

The  RL-TOPs  architecture  is  a  combination  of  a  sim¬ 
ple  goal-regression  TR-planner,  and  the  discounted- 


*  A  TOP  may  also  have  side-effects  which  arc  not  part  of 
its  post-condition,  but  these  are  not  relevant  to  the  current 
discussion. 
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Figure  1:  The  RL-TOPs  architecture. 


reward  reinforcement  learning  algorithm  C-Trace^ 
(Pendrith  &  Ryan,  1996).  An  outline  is  shown  in  Fig¬ 
ure  1. 

Based  on  the  domain  and  the  problem  to  be  solved, 
the  user  provides  five  things: 

•  A  Low-level  State  representation,  based  on  the 
robot’s  sensors, 

•  A  set  of  primitive  Actions,  based  on  the  available 
actuators, 

•  A  High-level  State  description  language  (which 
includes  whatever  features  of  the  state  space  are 
likely  to  be  relevant  to  the  planner,  including  the 
goal), 

•  A  Goal  description, 

•  A  set  of  behaviour  descriptions  of  the  form 
(Name,  Pre-image,  Post-condition),  which  form 
the  RL-TOP  Library. 

The  first  thing  the  system  does  is  to  supplement  each 
of  these  RL-TOPs  with  its  own  Q-Module.  This  con- 

^The  actual  reinforcement  learning  algorithm  used  is 
not  important,  except  insofar  as  it  must  support  learning 
from  both  successful  and  unsuccessful  trials.  This  includes 
most  common  RL  algorithms  such  as  Q-Learning  (Watkins, 
1989)  and  SARSA(A)  (Singh  &  Sutton,  1996). 


tains  all  the  information  required  by  the  reinforcement 
learning  algorithm  to  represent  the  behaviour.  The 
primary  component  is  the  utility  (or  Q)  function,  but 
there  may  be  other  components  depending  on  the  al¬ 
gorithm.  Unless  previously  saved  behaviours  are  being 
re-used,  the  Q-function  is  initialised  to  be  zero  every¬ 
where. 

Now,  given  the  goal  definition  and  the  library  of  be¬ 
haviours  available  to  it,  the  Planner  constructs  a  plan 
in  the  form  of  a  TR-Tree.  The  Planner  only  constructs 
as  much  of  the  tree  as  is  necessary  at  any  time.  Ini¬ 
tially  the  tree  consists  of  just  the  goal  node.  As  the 
agent  encounters  situations  which  aren’t  covered  by 
the  plan,  the  Planner  will  add  new  nodes  to  the  tree 
to  cover  these  states,  and  will  add  appropriate  actions 
to  the  plan  to  link  them  in  to  the  tree. 

The  plan  is  passed  to  the  Plan  Executor,  which 
also  reads  the  current  high-level  state  description,  and 
chooses  which  TOP  to  execute.  If  the  plan  does  not 
cover  the  state,  then  the  Executor  re-calls  the  Planner. 
Otherwise,  the  selected  TOP  is  passed  to  the  TOP  Ex¬ 
ecutor. 

The  TOP  Executor  takes  the  active  RL-TOP  and 
the  current  low-level  state,  and  decides  which  low-level 
action  to  execute.  Typically,  this  will  be  the  policy 
action  provided  by  the  TOP’S  Q-Module,  but  an  occa¬ 
sional  exploratory  action  may  also  be  performed.  For 
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Figure  2:  The  gridworld  domain. 


the  experiments  detailed  in  this  paper,  the  e-greedy 
exploration  algorithm  (Thrun,  1992)  was  used,  with 
e  =  0.1  (i.e.  at  each  step  a  random  exploratory  action 
is  chosen  with  probability  1  in  10.) 

The  result  of  the  executed  action,  in  terms  of  changes 
in  the  high-level  state  description,  is  used  by  the  Rein¬ 
forcement  Schema  to  determine  the  reinforcement 
feedback,  r,  to  provide  to  the  Learner.  This  unit  de¬ 
termines  whether,  in  terms  of  the  its  pre-image  and 
post-condition,  the  RL-TOP  has  succeeded  or  failed. 
If  the  post-condition  has  become  true,  then  the  TOP 
has  succeeded,  and  a  reward  of  r  =  -fl  is  returned. 
Otherwise,  if  the  pre-image  is  no  longer  true,  then  the 
TOP  has  failed  (by  exiting  its  application  space  pre¬ 
maturely),  and  a  punishment  of  r  =  -1  is  returned.  If 
neither  of  these  is  the  case,  then  r  =  0. 

Combining  the  low-level  state  and  action  information, 
and  the  reinforcement  signal  provided  by  the  Rein¬ 
forcement  Schema,  the  Learner  then  performs  the 
appropriate  update  on  the  RL-TOP ’s  Q-Module,  ac¬ 
cording  to  whatever  reinforcement  learning  algorithm 
is  used.  Then  the  process  repeats,  with  the  Plan  Ex¬ 
ecutor  deciding  which  TOP  to  execute  for  the  next 
time  step,  until  the  goal  is  achieved. 

5  EXPERIMENTAL  DOMAIN 

Experimental  work  is  currently  under  way  to  demon¬ 
strate  the  RL-TOPs  architecture  on  an  insectoid  robot 
called  Prometheus,  aiming  to  get  the  robot  to  learn 
how  to  walk  towards  a  beacon.  Results  from  this  plat¬ 
form  are  not  yet  available,  so  a  simple  simulated  do¬ 
main  was  constructed  to  demonstrate  the  system. 

The  simulation  consists  of  an  agent  in  a  30  x  21  grid- 
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Tablc  1;  RL-TOPs  used  for  gridworld  experiments. 


world,  as  shown  in  Figure  2.  At  the  low-level,  the 
agent  can  sense  its  position  within  the  world  (as  an 
xy-coordinate)  and  has  four  actions  available  to  it,  to 
move  north,  south,  east  or  west.  Each  action  is  guar¬ 
anteed  to  succeed  unless  there  is  a  wall  in  the  way. 

The  world  is  divided  into  five  rooms,  labelled  0  through 
4,  and  the  agent’s  goal  is  to  reach  a  particular  one, 
from  a  randomly  chosen  starting  position.  The  high- 
level  state  and  action  descriptions  are  all  in  terms  of 
w'hich  room  the  agent  occupies,  given  by  the  predicate 
room(R). 

For  each  of  the  experiments  following,  the  agent  was 
allowed  to  run  for  400  trials,  each  starting  at  a  random 
location  in  the  world  and  finishing  when  the  goal  is 
achieved.  The  length  of  each  trial,  in  terms  of  the  total 
number  of  low-level  actions  performed,  was  recorded. 
Twenty  such  runs  were  performed  for  each  algorithm 
presented,  and  the  results  are  the  average  trial  lengths 
over  these  twenty  runs. 

The  measurement  we  are  interested  in  comparing  is 
the  time  taken  to  learn  the  task,  that  is,  the  number 
of  primitive  actions  performed  before  the  agent  con¬ 
verged  to  an  optimal  (or  near-optimal)  policy.  To  this 
end,  the  graphs  compare  cumulative  trial  lengths  for 
each  experiment.  The  cumulative  trial  length  is  the 
sum  of  the  lengths  all  trials  up  to  and  including  the 
current  one. 

5.1  EXPERIMENT  1:  MODULAR  VS. 

MONOLITHIC 

The  first  experiment  demonstrates  the  improvement 
in  performance  of  the  modular  RL-TOPs  architec¬ 
ture  over  a  simple  monolithic  reinforcement  learner. 
The  agent’s  goal  is  to  reach  room  4.  The  monolithic 
learner  has  a  single  Q-Module  which  covers  the  entire 
state  space,  whereas  the  modular  learner  has  been  pro¬ 
vided  with  eight  RL-TOP  descriptions,  corresponding 
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Graph  1:  Learning  times  for  gridworld  task  using  (la) 
Monolithic  learner,  (lb)  RL-TOPs,  (2)  RL-TOPs  re¬ 
using  previously  learnt  behaviours,  (3)  RL-TOPs  using 
behaviours  learnt  with  concurrent  learning. 


to  movement  from  one  room  to  an  adjoining  one,  as 
listed  in  Table  1.  The  TR-Tree  produced  by  the  plan¬ 
ner  is  shown  in  Figure  3. 

The  C-Trace  learning  algorithm  was  used  in  both 
cases,  with  the  learning  rate  ^  =  0.1  and  discount 
factor  7  =  0.9.  The  monolithic  learner  was  rewarded 
on  success  only,  with  a  reinforcement  value  of  1.  As 
with  the  RL-TOPs  algorithm,  the  monolithic  learner 
used  the  e-greedy  exploration  algorithm,  with  e  =  0.1. 

Graph  1  shows  the  results  of  the  two  experiments. 
Both  approaches  converged  to  a  nearly  optimal  pol¬ 
icy  within  about  200  trials,  but  the  monolithic  learner 
took  about  40,000  more  steps  to  reach  this  point.  A 
large  part  of  this  difference  is  established  in  the  first  20 
trials,  which  took  the  monolithic  and  modular  systems, 
25,372  and  11,253  steps  respectively.  This  demon¬ 
strates  the  important  difference  between  the  two.  In 
the  early  stages  of  learning,  when  the  Q-function  is  still 
mostly  zero,  the  only  actions  that  provide  any  informa¬ 
tion  are  those  that  provide  non-zero  feedback.  Since, 
in  the  monolithic  case,  rewards  are  few,  the  agent  has 
nothing  to  direct  it,  and  a  large  amount  of  time  is 
spent  aimlessly  exploring  the  world,  without  learning 
anything. 

In  the  modular  system,  however,  the  application 
spaces  for  individual  behaviours  are  smaller,  so  the 
rewards  (and  penalties)  are  closer  at  hand.  Thus  ran¬ 
dom  exploration  is  more  likely  to  result  in  useful  in¬ 
formation  more  quickly,  and  learning  is  significantly 
faster. 


5.2  EXPERIMENT  2:  RE-USING 
BEHAVIOURS 

Another  advantage  of  the  modular  system  over  the 
monolithic  is  that  the  individual  behaviours  learnt  in 
the  modular  trials  can  be  re-used  in  a  way  that  the 
monolithic  policy  cannot.  In  the  next  experiment, 
the  same  RL-TOPs  from  the  previous  experiment  were 
used,  with  the  Q-Modules  saved  from  each  run,  in  or¬ 
der  to  solve  a  new  problem. 

The  goal  is  now  to  reach  room  3.  The  new  plan  is 
shown  in  Figure  4.  Notice  that  it  includes  two  of 
the  behaviours  learnt  in  the  previous  experiment  go02 
and  gol2.  The  other  two  behaviours,  go42  and  go23, 
haven’t  been  used  before  and  still  need  to  be  learnt. 

From  the  graph,  we  can  see  that  a  significant  amount 
of  time  is  saved  in  learning  to  perform  this  new  task, 
compared  to  the  previous  one,  which  did  not  have  the 
benefit  of  pre-existing  behaviours.  The  reason  for  this 
is  obvious:  the  agent  does  not  need  to  waste  time  re¬ 
learning  the  go02  and  gol2  behaviours. 

Still,  a  significant  amount  of  time  was  taken  up  with 
learning  the  go23  behaviour  which  would  appear  to  be 
redundant.  Although  the  agent  has  never  performed 
this  behaviour  before,  it  has  nevertheless  spent  a  lot 
of  time  in  room  2  in  the  previous  experiment,  albeit 
while  executing  a  different  behaviour.  Common  sense 
suggests  that  this  prior  experience  should  be  of  some 
use  in  learning  the  new  behaviour  more  quickly.  Is 
it  possible  to  make  use  of  information  gathered  while 
executing  one  behaviour  in  order  to  learn  another?  We 
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Figure  4:  The  TR-Tree  for  going  to  room  3. 


address  this  question  in  the  next  section  of  this  paper. 

6  CONCURRENT  LEARNING: 
MAKING  BETTER  USE  OF 
EXPERIENCE 

At  this  point,  one  under-appreciated  feature  of  cer¬ 
tain  RL  algorithms  comes  to  our  aid.  Algorithms  such 
as  Q-Learning  and  C-Trace  (but  not  SARSA)  are  off- 
policy  learners,  which  means  that  they  sequence  of  ac¬ 
tions  presented  to  the  learner  do  no  have  to  correspond 
to  an  actual  execution  of  the  policy  (Sutton  &  Barto, 
1998).  It  is  even  possible  to  learn  one  behaviour  while 
executing  a  quite  different  one,  so  long  as  their  appli¬ 
cation  spaces  overlap. 

This  technique,  called  concurrent  learning  can  be 
added  to  the  RL-TOPs  architecture  by  a  simple  mod¬ 
ification  to  the  Learner  module.  Rather  than  just  up¬ 
dating  the  Q-Module  of  the  currently  active  TOP,  the 
Learner  examines  the  RL-TOP  Library  and  selects  all 
the  behaviours  which  are  eligible  to  be  updated.  This 
includes  any  behaviour  the  pre-image  of  which  was 
satisfied  before  the  most  recent  action  was  performed. 
Thus,  to  use  the  simulation  above  as  an  example,  if  the 
agent  executes  some  action  in  room  2,  then,  regardless 
of  the  result  of  the  action,  all  those  behaviours  which 
have  room (2)  as  their  pre-image,  will  be  eligible  to  be 
updated. 

The  Learner  then  consults  the  Reinforcement  Schema 
for  each  behaviour  separately,  to  find  out  the  reinforce¬ 
ment  value  for  that  particular  TOP.  For  some,  the  ac¬ 
tion  just  executed  may  comprise  success,  for  others 
failure,  and  for  others  neither  of  the  two.  The  Learner 
uses  the  reinforcement  value  for  each  TOP,  to  update 
that  top’s  Q-Module.  Then  execution  proceeds  as 


usual. 

This  technique  should  significantly  speed  up  learning 
more  than  one  task,  bccavise  it  makes  more  effective 
use  of  experience  gained. 

6.1  EXPERIMENT  3:  CONCURRENT 
LEARNING 

To  demonstrate  the  benefit  of  concurrent  learning  the 
two  previous  experiments  were  repeated,  but  this  time 
with  all  eligible  behaviours  being  learnt  concurrently. 
First  the  agent  did  400  trials  with  room  4  as  its  goal. 
Then,  using  the  same  learnt  behaviours,  the  goal  was 
changed  to  room  3.  Graph  1  shows  the  results  of  this 
run.  Compare  these  to  the  results  of  experiment  2, 
which  had  the  same  goal,  but  did  not  use  concurrent 
learning.  The  concurrent  system  converged  in  very  lit¬ 
tle  time  at  all.  The  behaviour  go23  was  almost  com¬ 
pletely  optimised  before  it  was  even  run.  The  only  be¬ 
haviour  to  be  learnt  was  go42,  because  tbe  agent  had 
had  no  prior  experience  with  performing  any  actions 
in  room  4. 

7  RELATED  WORK 

In  addition  to  those  already  mentioned,  other  hi¬ 
erarchical  learning/planning  systems  of  note  include 
Singh’s  Compositional  Q-Learning  system  (Singh, 
1992),  which  learns  a  Q-function  for  a  complex  prob¬ 
lem  by  constructing  a  gating  module,  which  selects  an 
appropriate  lower-level  behaviour  at  each  step;  and  the 
work  of  Preeup  et  al.  (Prccup,  Sutton,  &  Singh,  1997), 
which  extends  standard  dynamic  programming  tech¬ 
niques  to  be  able  to  use  macro  actions  (behaviours) 
as  well  as  primitive  actions  in  their  policies.  Both 
of  these  systems  assume  that  the  behaviours  that  are 
used  are  already  fully  specified,  perhaps  by  earlier 
learning  runs. 

Benson  has  produced  a  system  that  is  complementary 
to  that  presented  here.  His  TRAIL  (Benson,  1996)  ar¬ 
chitecture  takes  an  existing  set  of  actions  or  behaviours 
and,  by  guided  experiments,  learns  appropriate  TOP 
descriptions.  It  may  bo  possible  to  combine  that  work 
and  this,  to  produce  a  system  in  which  learnt  infor¬ 
mation  goes  in  both  directions,  refining  both  the  be¬ 
haviours  and  the  model. 

8  CONCLUSION 

As  has  been  demonstrated,  modular  decomposition  is 
an  effective  way  to  improve  the  speed  of  reinforce- 
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ment  learning  algorithms.  The  Reinforcement  Learnt 
Teleo-operators  (RL-TOPs)  architecture,  combining 
low-level  reinforcement  learning  with  high-level  sym¬ 
bolic  planning,  is  an  elegant  and  effective  way  of  ex¬ 
pressing  this  decomposition.  The  system  allows  the 
automatic  construction  of  appropriate  hierarchies  of 
learnt  behaviours  to  solve  a  give  problem,  and  pro¬ 
vides  a  means  of  re-using  behaviours  learnt  in  one  task, 
for  solving  another.  With  the  addition  of  concurrent 
learning  of  multiple  behaviours,  this  can  greatly  im¬ 
prove  learning  times  over  a  variety  of  problems. 

A  limitation  of  this  system  is  that  the  policy  learnt 
is  sub-optimal  because  the  agent  cannot  “cut  corners” 
between  behaviours.  Work  is  in  progress  to  find  a  way 
to  allow  the  agent  to  benefit  from  the  domain  informa¬ 
tion  given  by  the  task  decomposition,  while  still  being 
able  to  converge  eventually  to  an  optimal  policy. 

Another  avenue  for  future  research  would  be  to  investi¬ 
gate  the  question  of  what  to  do  when  the  programmer- 
specified  TOPs  are  insufficient  to  find  a  path  to  the 
goal.  Possibly  the  system  could  be  extended  so  as  to 
postulate  its  own  new  behaviours  in  this  state.  How¬ 
ever,  this  is  likely  to  be  a  very  difficult  problem. 
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Abstract 

To  evolve  structured  programs  we  intro¬ 
duce  H-PIPE,  a  hierarchical  extension  of 
Probabilistic  Incremental  Program  Evolution 
(PIPE).  Structure  is  induced  by  “hierarchi¬ 
cal  instructions”  (His)  limited  to  top-level, 
structuring  program  parts.  “Skip  nodes” 
(SNs)  allow  for  switching  program  parts  on 
and  off.  They  facilitate  synthesis  of  certain 
structured  programs.  In  our  experiments  H- 
PIPE  outperforms  PIPE:  structural  bias  can 
speed  up  program  synthesis. 

Keywords:  Probabilistic  Incremental  Program  Evo¬ 
lution,  Structured  Programs,  Hierarchical  Programs, 
Non-Coding  Segments. 

1  Introduction 

Overview.  Automatic  program  synthesis  is  of  in¬ 
terest  because  it  addresses  the  problem  of  searching 
in  general  algorithm  space  as  opposed  to  more  lim¬ 
ited  search  spaces  like  those  of,  say,  feedforward  neu¬ 
ral  networks.  Hierarchical  Probabilistic  Incremental 
Program  Evolution  (H-PIPE)  is  a  novel  method  for 
synthesizing  structured  programs.  It  uses  the  PIPE 
paradigm  (Salustowicz  and  Schmidhuber,  1997)  to  it¬ 
eratively  generate  successive  populations  of  functional 
programs  from  an  adaptive  probability  distribution 
over  all  possible  programs  constructible  from  a  prede¬ 
fined  instruction  set.  As  in  PIPE  the  probability  dis¬ 
tribution  is  adapted  in  three  ways:  (1)  Each  iteration 
the  probability  of  the  best  program  in  the  current  pop¬ 
ulation  is  increased;  (2)  occasionally  the  probability  of 
the  best  program  found  so  far  (elitist)  is  increased;  (3) 
sometimes  probabilities  are  mutated  to  better  explore 


the  search  space.  H-PIPE  uses  “hierarchical  instruc¬ 
tions”  (HIs)  and  “skip  nodes”  (SNs).  His  can  be  used 
to  combine  lower-level  program  parts,  thus  inducing 
structure.  SNs  function  as  gates  that  allow  for  keep¬ 
ing  program  parts  dormant  without  losing  them  in  the 
course  of  evolution.  In  combination  with  HIs  they  also 
enable  H-PIPE  to  substitute  program  parts  by  supe¬ 
rior  partial  solutions  discovered  at  later  evolutionary 
stages. 

Structure.  Early  genetic  programming  (GP)  work 
(Dickmanns  et  al.,  1987)  as  well  as  Adaptive  Levin 
Search  (Schmidhuber,  1997,  Schmidhuber  et  al., 
1997b)  allow  for  powerful  programs  with  arbitrary 
loops  etc.  Sometimes,  however,  it  is  beneficial  to  in¬ 
troduce  inductive  bias  by  appropriately  constraining 
the  search  space  of  possible  programs.  Except  for 
programs  evolved  by  tree-based  GP  (Cramer,  1985; 
Koza,  1992),  however,  not  much  work  has  been  done 
on  evolution  of  programs  with  significant  structural 
constraints.  There  are  two  such  GP  variants. 

The  first  reuses  program  parts,  usually  in  a  way  less 
general  than  that  achievable  through  arbitrary  jumps 
(Dickmanns  et  al.,  1987).  Typically  subprograms  are 
generated  and/or  extracted  from  evolved  programs; 
they  may  then  be  called  in  a  usually  non-recursive 
fashion  from  different  positions  in  the  code.  Exam¬ 
ples  are:  “automatically  defined  functions”  and  encap¬ 
sulation  (Koza,  1992),  module  acquisition  (Angeline 
and  Pollack,  1992),  adaptive  representations  through 
learning  (Rosea  and  Ballard,  1996),  automatically  de¬ 
fined  macros  (Spector,  1996).  Other  approaches  do  not 
generate  or  extract  subprograms  but  restrict  GP’s  re¬ 
combination  operator  such  that  it  cannot  destroy  cer¬ 
tain  program  parts  to  be  reused  in  the  future  (e.g., 
Langdon,  1995;  Pringle,  1995;  Zannoni  and  Reynolds, 
1997). 

The  second  variant  uses  grammars  to  induce  struc- 
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ture,  constrain  the  search  space,  and  provide  initial 
bias  to  speed  up  evolution.  Examples  are  context- 
free  (Whigham,  1995,  Gruau,  1996)  or  logic  grammars 
(Wong  and  Leung,  1996). 

Hierarchical  Instructions.  H-PIPE’s  programs  are 
composed  of  instructions  from  a  fixed  instruction  set 
S  =  Each  node  of  the  code  tree  con¬ 

tains  an  instruction  I  and  can  have  several  son  nodes 
whose  instructions  are  viewed  as  arguments  of  I.  Pro¬ 
grams  with  hierarchical  instructions  (His)  are  special 
cases  of  programs  constrained  by  context-free  gram¬ 
mars:  We  partition  S  into  m  disjoint,  non-empty  in¬ 
struction  sets  5°,  5^, . . . ,  S”",  and  ensure  that  all  “ter¬ 
minal  instructions”  -  instructions  with  zero  arguments 
-  are  in  5°.  Hierarchical  order  is  imposed  as  follows: 
Each  argument  of  an  instruction  in  5”  is  in  5”  or  in  the 
“lower  level”  set  5”“^.  At  least  one  argument  must  be 
in  5”“^,  except  when  v  =  0.  Higher-level  instructions 
can  be  used  to  combine  program  parts  made  out  of 
lower-level  instructions,  thus  inducing  structure. 

Non-Coding  Program  Parts.  Non-coding  program 
parts  (“introns”)  are  those  that  do  not  affect  the  re¬ 
sults  the  program  calculates.  E.g.,  in  f{x)  =  x*l,  the 
“*1”  part  is  non-coding.  Most  previous  work  on  non¬ 
coding  program  parts  focuses  on  genetic  program  syn¬ 
thesis  (Blickle  and  Thiele,  1994,  McPhee  and  Miller, 
1995,  Nordin  et  al.,  1996,  Haynes,  1996,  Wineberg  and 
Oppacher,  1996).  Usually  non-coding  program  parts 
evolve  or  can  be  inserted  to  protect  coding  program 
parts  (parts  that  do  affect  results  calculated  by  the 
program)  from  destructive  genetic  recombination  op¬ 
erators  (Blickle  and  Thiele,  1994,  McPhee  and  Miller, 
1995,  Nordin  et  al.,  1996,  Haynes,  1996).  Blickle  and 
Thiele  (1994),  as  well  as  McPhee  and  Miller  (1995), 
however,  point  out  that  large  blocks  of  non-coding  seg¬ 
ments  in  tree-based  GP  programs  cause  very  slow  con¬ 
vergence  and  difficulties  in  escaping  from  local  minima. 
Haynes  (1996),  on  the  other  hand,  shows  that  artificial 
removal  of  non-coding  segments  from  those  programs 
leads  to  premature  convergence.  Nordin,  Francone, 
and  Banzhaf  (1996)  investigate  the  role  of  non-coding 
segments  in  a  GP  approach  based  on  variable-length 
strings.  They  note  that  non-coding  segments  may  play 
an  important  role  in  finding  good  solutions  and  speed¬ 
ing  up  convergence.  Wineberg  and  Oppacher  (1996) 
use  ^a;ed-length  strings  and  find  that  non-coding  seg¬ 
ments  reduce  the  search  space  and  speed  up  evolution. 

General  observation.  The  literature  above  suggests: 
in  tree-based  GP  programs  with  little  structure,  the 
effect  of  non-coding  segments  is  twofold.  On  the  one 
hand  they  seem  necessary  to  protect  blocks  of  coding 


segments,  on  the  other  hand  they  can  hinder  discovery 
of  acceptable  solutions.  In  the  case  of  structured  pro¬ 
grams,  however,  non-coding  program  parts  can  both 
speed  up  convergence  and  aid  in  finding  good  solu¬ 
tions.  Loosely  speaking,  the  more  structured  the  pro¬ 
grams  (e.g.,  the  greater  the  restrictions  on  the  cod¬ 
ing  strings),  the  higher  the  potential  significance  of 
non-coding  segments.  Our  own  experiments  with  skip 
nodes  will  add  more  empirical  evidence  in  this  direc¬ 
tion. 

Skip  Nodes  (SNs).  Much  like  certain  “jump”  in¬ 
structions,  skip  nodes  (SNs)  are  instructions  that  al¬ 
low  for  skipping  program  parts.  In  the  context  of  tree- 
based  functional  programs,  SNs  are  functions  with  n 
arguments,  where  n  denotes  the  maximal  number  of 
arguments  of  functions  in  S.  SNs  return  exactly  one 
of  their  arguments  and  ignore  the  others,  which  thus 
represent  non-coding  program  parts  if  n  >  1.  We  will 
demonstrate  the  benefits  of  SNs  in  structuring  parts 
of  H-PIPE  programs. 

Outline.  Section  2  describes  the  H-PIPE  approach. 
Section  3  compares  the  use  of  His  and  SNs  to  standard 
PIPE  on  function  regression  and  6-bit  parity.  Section 

4  concludes. 

2  Hierarchical  PIPE 

Overview.  We  will  describe  H-PIPE,  a  hierarchi¬ 
cal  extension  of  PIPE  (Salustowicz  and  Schmidhu- 
ber,  1997).  Like  PIPE,  H-PIPE  combines  probability 
vector  coding  of  program  instructions  (Schmidhuber 
et  al.,  1997a,  1997b)  ,  Population-Based  Incremental 
Learning  (PBIL  -  Baluja  &  Caruana,  1995),  and  tree- 
coded  programs  like  those  used  in  variants  of  GP.  Un¬ 
like  PIPE,  H-PIPE  uses  His  to  evolve  structured  pro¬ 
grams  and  SNs  to  facilitate  this  process.  We  will  first 
describe  His  and  then  SNs. 

2.1  Hierarchical  Instructions  (His) 

Program  Instructions.  H-PIPE’s  programs  are 
composed  from  z  instructions  in  the  instruction  set 

5  —  {/i,/2> •  •  -yh}-  Each  instruction  Ij  {1  <  j  <  z) 
is  either  a  function  or  a  terminal.  Functions  and  ter¬ 
minals  differ  in  that  the  former  have  one  or  more  ar¬ 
guments  and  the  latter  have  zero.  Thus  S  =  F  LIT, 
where  F  =  {/i, /a,  •  •  • , /*}  is  a  function  set  with  k 
functions  and  T  —  {ti,t2,  ■  ■  ■ ,  U}  is  a  terminal  set  with 
I  terminals.  Since  F  DT  =  {},  z  =  k  +  I  holds.  Pro¬ 
grams  are  encoded  in  trees.  Each  node  of  the  code 
tree  contains  an  instruction  I  and  can  have  several 
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son  nodes  whose  instructions  are  viewed  as  arguments 
of  I.  To  allow  for  His  we  partition  S  into  m  dis¬ 
joint,  non-empty  instruction  sets  5\  . . . ,  5™,  and 
ensure  that  all  “terminal  instructions”  -  instructions 
with  zero  arguments  -  are  in  5°.  Hierarchical  order 
arises  as  follows;  Each  argument  of  an  instruction  in 
S'’  is  in  5"  or  in  the  “lower  level”  set  At  least 

one  argument  must  be  in  5”“*,  except  when  u  =  0. 
To  allow  for  enforcing  descents  in  the  instruction  set 
hierarchy  we  add  “level  down”  instructions  ji  to  all 
instruction  sets  5”  (0  <  u  <  m),  where  0  <  i  <  l{v) 
is  the  argument  index  of  an  instruction  I  €  S'’  with 
l{v)  arguments  from  S""^.  Although  “level  downs” 
take  a  single  argument  and  return  it,  they  are  treated 
as  terminal  symbols.  Thus  each  instruction  set  5” 
{0  <  V  <  m)  can  be  written  as  F"  U  T",  where 
F"  =  {/i', /a,  •••,/*(„)}  is  a  function  set  with  k{v) 
functions  and  T"  =  {fo,  ii, . . . ,  ii(„)}  is  a  terminal 
set  containing  l{v)  “level  down”  instructions.  We  also 
have  S°  =  F°U  T\  where  F"  =  {/“,  /«, . . . , /,%)}  is 
a  function  set  with  fc(0)  functions  and  =  T  is  a 
terminal  set  containing  all  terminals  of  S  (1(0)  =  1). 

To  solve  a  one-dimensional  function  approximation 
task  one  might  use  F  =  sin,  cos,exp,rlog} 

and  T  =  {a;,i?},  where  %  denotes  protected  di¬ 
vision  (Vy,u  €  M,u  0:  y%u  =  y/u  and 
2/%0  =  1);  rlog  denotes  protected  logarithm  (Vj/  € 
M,y  0:  rlog{y)=\og{ahs{y))  and  rlog{0)  =  0); 
X  is  an  input  variable;  and  i?  is  a  generic  ran¬ 
dom  constant  in  [0;1)  (see  below).  To  structure 
this  function  approximation  task  as  a  linear  combi¬ 
nation  of  non-linear  parts  we  split  the  instruction  set 
S  =  {-f, sin,  cos,  exp,  rlog,  x,R)  into  5®  = 
{*,%,sin,cos,exp,rlog,x,R}  and  S^  =  {+,-}.  We 
then  add  a  fo  instruction  to  S^  and  obtain  5*  = 
{+)  io}-  Function  and  terminal  sets  for  the  lower 
and  upper  level  then  become  F”  =  {*,  %,  sin,  cos,  exp, 
rlog  },T°  =  {a:,F}  and  F^  =  {+,-}, =  {io},  re¬ 
spectively.  Figure  1  shows  an  example  program. 

Generic  Random  Constants.  A  generic  random 
constant  (GRC)  (compare  also  “ephemeral  random 
constant”  (Koza,  1992))  is  a  zero  argument  function 
(a  terminal).  When  accessed  during  program  creation, 
it  is  either  instantiated  to  a  random  value  from  a  pre¬ 
defined,  problem-dependent  set  of  constants  or  a  value 
previously  stored  together  with  the  probability  distri¬ 
bution  (see  below). 

Program  Representation.  With  His  the  arity  n{v) 
of  a  program  tree  may  vary  depending  on  the  hierar¬ 
chical  level  V.  On  each  level  v,  n{v)  is  the  maximal 
number  of  function  arguments  required  by  functions 


Figure  1:  f(x)=x*sin(x)+exp(cos(0.2))+x%0.1-(x+- 
rlog(x)).  Exemplary  program  tree  for  function  approx¬ 
imation  constrained  to  a  linear  combination  of  non¬ 
linear  parts.  Top-level  structuring  instructions  from 
5*  appear  in  boldface. 


Figure  2:  f(x)=0.7*x*sin(x)+0.2%(x*x*x)-x.  Exem¬ 
plary  program  tree  for  function  approximation,  with 
different  level-dependent  arities.  Top-level  program 
parts  are  2-ary.  Lower  level  program  parts  are  3-ary. 

in  S".  For  instance,  in  the  function  approximation  ex¬ 
ample  above,  if  we  add  to  S°  a  three  argument  func¬ 
tion,  e.g  ♦*,  where  **{0,1,02,03)  =  oi  *  02  *  03,  then 
the  lower-level  part  of  the  program  tree  will  be  3-ary 
while  the  top-level  part  will  remain  2-ary,  as  depicted 
in  Figure  2. 

Probability  Distribution.  The  probability  dis¬ 
tribution  is  stored  in  a  “hierarchical  probabilistic 
prototype  tree”  {H-PPT).  At  each  hierarchical  level 
v(0  <  V  <  m)  the  H-PPT  generally  contains  in¬ 
finite  n{v)-ssy  subtrees  where  the  list 

dw{v)  =  ((d„.(-l ,  Wu+l  )i  (dt;4-2 )  W„.)-2),  .  •  .  ,  {dm,'0>m)) 
describes  the  absolute  position  of  a  subtree:  it  con¬ 
tains  0  to  771  —  1  components  depending  on  the  hi¬ 
erarchical  position  V  of  ppp^'^i'’)  (0  components,  if 
v  =  m).  Each  component  pair  {di,Wi)  describes  the 
position  of  a  higher  level  node  in  PPT^^M  to 

which  pppi^W  is  attached.  The  position  of  a  node 
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^d^,wi  iiiside  a  subtree  is  defined  by 

depth  dj  >  0  (PPJ^"’(*)’s  root  node  has  di  =  0)  and  its 
horizontal  position  Wi  when  subtree  nodes  with  equal 
depth  are  read  from  left  to  right  (0  <  Wj  <  n{iY‘). 
Each  node  contains  a  variable  probability  vec¬ 
tor  In  addition,  each  node  contains 

a  random  constant  The  probability  vectors 

Vu  :  0  <  u  <  m  have  k{v)  -t-  l{v)  compo¬ 
nents.  Each  component  Vv  :  0  <  v  <  m 

denotes  the  probability  of  choosing  instruction  /  €  5® 
at  We  maintain  Zies«  =  1- 


H-PPT  InitiaKzation.  Each  H-PPT  node 

requires  an  initial  probability  for  each  in¬ 

struction  I  €  5®.  Furthermore,  each  bottom  level 
(u  =  0)  node  requires  an  initial  random  con¬ 
stant  We  pick  uniformly  random  in  the 

interval  f0;l).  To  initialize  instruction  probabilities  we 
use  for  each  hierarchical  level  v  a  constant  probability 
Pt”  for  selecting  an  instruction  from  T®  and  (1  -Pt-) 
for  selecting  an  instruction  from  P®.  is  then 

initialized  as  follows: 


pdw(v) 

^d„  ,Wv 


(I)  := 


Pt- 

Kvy 


WI:  I  eT^ 


and 


pdw(v) 
^dv  ,w„ 


(/)  := 


1  —  P^TJ 

k{v) 


Will  eF'’ 


Program  Generation.  Program  generation  in  H- 
PIPE  is  analogous  to  program  generation  in  PIPE  (see 
Salustowicz  and  Schmidhuber,  1997),  except  that  in¬ 
structions  are  selected  from  the  appropriate  5®,  de¬ 
pending  on  the  hierarchical  level.  To  generate  a  pro¬ 
gram  Prog  from  H-PPT,  an  instruction  /  €  5®  is 
selected  with  probability  for  each  accessed 

node  Arf^fy^  of  H-PPT.  This  instruction  is  denoted  by 

^d^wl-  Nodes  are  accessed  in  a  depth-first  way,  start¬ 
ing  at  the  root  node  AFo,o,  and  traversing  H-PPT  {torn 
left  to  right.  Figure  3  shows  a  H-PPT  and  a  corre¬ 
sponding  possible  program. 

Tree  Shaping.  To  reduce  memory  requirements  and 
allow  for  discarding  elements  of  the  probability  dis¬ 
tribution  that  have  become  irrelevant  over  time  the 
H-PPT  is  incrementally  grown  and  pruned  just  like 
pipe’s  probability  tree  (see  Salustowicz  and  Schmid¬ 
huber,  1997). 

Update  Rules.  H-PIPE’s  update  rules  are  analogous 
to  pipe’s  (see  Salustowicz  and  Schmidhuber,  1997). 


The  only  difference  is  the  more  sophisticated  indexing 
method  due  to  H-PPTs  the  hierarchical  structure. 


2.2  Skip  Nodes  (SNs) 

Overview.  Skip  nodes  are  functions  that  serve  to 
switch  code  parts  on  and  off.  We  will  first  define  SNs 
for  PIPE,  then  for  H-PIPE. 

SNs  for  PIPE,  pipe’s  probability  distribution  is 
stored  in  a  probabilistic  prototype  tree  (PPT  -  see 
Salustowicz  and  Schmidhuber  (1997)  for  details).  Let 
n  denote  the  maximal  arity  of  the  PPT  (the  maximal 
number  of  arguments  of  functions  that  are  not  SNs). 
There  are  at  most  n  SNs.  The  i-th  is  denoted  — >i.  It 
is  a  function  with  n  arguments  and  returns  the  i-th. 
Its  interpretation  is:  evaluate  the  i-th  argument  but 
ignore  the  others. 

SNs  are  elements  of  the  function  set  F.  For  instance, 
if  we  add  SNs  to  the  instruction  set  of  the  function 
approximation  example  from  Section  2.1  we  obtain: 
F  =  {-t-, sin,  cos,exp,rlog,-*o,-*i}  and  T  = 
{x,  i?}.  Figure  4  shows  an  unstructured  PIPE  program 
with  SNs.  The  dashed  parts  of  the  program  can  be 


Figure  4:  A  PIPE  program  with  SNs  for  function  ap¬ 
proximation:  /(x)  =  (0.11-1- x)*  (0.2  — x).  The  dashed 
parts  of  the  program  are  non-coding  segments. 


Figure  5:  A  H-PIPE  program  with  SNs  for  function 
approximation:  /(x)  =  exp{cos{0.2))  -I-  x  -I-  rlog{x). 
The  dashed  parts  of  the  program  are  non-coding  seg¬ 
ments. 
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Figure  3:  A  H-PPT  (left)  and  a  corresponding  possible  program  (right).  The  structuring  parts  of  the  program 
are  highlighted. 


viewed  as  non-coding  segments.  Note  that  they  need 
not  even  be  created  during  program  generation  and 
are  therefore  computationally  cheap. 

SNs  for  H-PIPE.  Let  h{v)  denote  the  maximal  num¬ 
ber  of  arguments  of  non-SN  functions  in  S'’.  At  level 
V  (0  <  V  <  m)  there  are  at  most  h(v)  SNs.  The  f-th  is 
denoted  — It  is  a  function  with  h(v)  arguments  and 
returns  the  i-th.  Its  interpretation  is;  evaluate  the  i-th 
argument  but  ignore  the  others.  There  are  no  SNs  in 
5“. 

SNs  are  elements  of  the  function  set  F'’.  For  in¬ 
stance,  if  we  add  SNs  to  the  instruction  set  of  the 
function  approximation  example  from  Section  2.1  we 
obtain:  F°  =  {*,%,sin,co3,exp,rlog},T’^  =  (x,R} 
and  F^  =  =  {lo}-  Figure  5  shows  a 

H-PIPE  program  with  SNs. 

Changes  to  PIPE’S  and  H-PIPE’s  Update 
Rules. 

(1)  Parts  of  PPT  or  H-PPT  corresponding  to  non¬ 
coding  segments  are  not  updated.  (2)  To  mutate  prob¬ 
abilities  we  calculate  program  size  |PROGi,|.  With 
SNs  IProGjI  denotes  the  number  of  nodes  in  program 
PROGft  without  the  non-coding  segments  created  by 
SNs.  See  Salustowicz  and  Schmidhuber  (1997)  for  de¬ 
tails. 


3  Experiments 

To  evaluate  the  impact  of  His  and  SNs  we  cross¬ 
compare:  (1)  PIPE,  (2)  H-PIPE  without  SNs  (H- 
PIPE-NO-SN),  (3)  PIPE  with  SNs  (PIPE-SN),  (4)  and 
H-PIPE  (PIPE  with  His  and  SNs  in  the  structuring 
program  parts).  To  illustrate  the  significance  of  appro¬ 
priate  initial  bias  we  also  test  H-PIPE  with  different 
structuring  instructions  (H-PIPE-DIFF).  We  consider 
a  nontrivial  continuous  function  regression  problem 
and  the  6-bit  parity  problem,  a  discrete  task  involving 
just  65  distinct  fitness  values.  For  each  combination 
of  learning  algorithm  and  problem  we  conduct  50-200 
independent  runs  to  obtain  statistically  significant  re¬ 
sults. 

3.1  Function  Regression 

The  function  to  be  approximated  is  plotted  in  Figure  6. 
The  training  data  set  Dtr  samples  /  at  101  equidistant 
points  in  the  interval  [0;10].  Dtr  is  used  to  calculate 
fitness  values  during  program  evolution.  Thus,  the 
fitness  value  of  each  program  Prog  is  F/T(Prog)  = 
HvxeD,,.  l/(^)  ~  PitOG(2;)|,  where  Prog(3;)  denotes 
the  result  of  applying  Prog  to  data  x. 

Set-up.  We  time-constrain  all  runs  to  PE  =  100,000 
and  use  the  following  parameter  setting  empirically 
found  to  work  well:  Pt=Pt°=^Pt^=^'^^  £  =  0.000001, 
Fe,=0.01,  PS=10,  lr=0.01,  Pm=0.4,  mr=0.4,  Tr=0.3, 
rp=0. 999999,  FIT,  =  0  (see  Salustowicz  and  Schmid- 
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Figure  6:  f{x)  =  •  e  ®  •  cos{x)  ■  sin(x)  •  (sin^(x)  • 

cos(x)  —  1) 


huber  (1997)  for  detailed  description  of  parameters). 
We  use  the  following  instruction  sets:  (1)  PIPE:  F  = 
%,  sin,  cos,  exp,  r log},  T  —  (2)  H- 

PIPE-NO-SN:  =  {+,-},  =  {io},  F°  =  {♦, 

%,sin,cos,  exp,  rlog},  T®  =  (3)  PIPE-SN: 

F  =  %,  sin,  cos,  exp,  rlog, T  = 

{x,Ry,  (4)  H-PIPE:  F^  =  =  {io}, 

F°  =  {*,  %,sin,cos,exp,rlog},  T°  =  (5)  H- 

PIPE-DIFF:  F^  =  =  {io},  F°  =  {+, 

-,  sin, cos, exp, rlog},  T°  =  {a;,  iJ}. 

Results.  Figure  7  summarizes  all  results  in  form 
of  cumulative  histograms.  We  plot  performance  u 
against  percentage  of  programs  with  FIT(Prog)  < 
u.  Each  point  indicates  the  number  of  programs  with 
F/T(Prog)  equal  to  or  better  than  its  x-axis  value: 
algorithms  with  better  performance  have  more  points 
with  smaller  x-values. 

PIPE  vs.  H-PJPE.  H-PIPE  outperforms  PIPE.  H- 
PIPE’s  fitness  in  the  median  run  is  Filmed  =  2.39, 
slightly  better  than  PIPE’S  with  FITmed  =  2.55.  In 
82%  of  all  runs  H-PIPE  finds  programs  with  fitness 
below  4,  while  only  67%  of  all  PIPE  runs  accomplish 
this.  On  the  other  hand,  the  worst  3%  of  all  H-PIPE 
runs  resulted  in  programs  worse  than  the  best  found 
by  all  PIPE  runs.  The  median  of  H-PIPE’s  program 
size  (Nodcmed  =  92  nodes)  is  significantly  smaller  than 
PIPE’S  {NodCmed  =  157). 

How  much  of  the  performance  improvement  can  be 
attributed  to  His,  how  much  to  SNs?  To  study  this 
question  we  now  compare  PIPE  and  H-PIPE  to  PIPE 
with  SNs  (PIPE-SN)  and  H-PIPE  without  SNs  (H- 
PIPE-NO-SN). 

PIPE  &  H-PIPE  vs.  PIPE-SN.  PIPE-SN  per¬ 
forms  much  like  PIPE,  and  worse  than  H-PIPE.  PIPE- 
SN’s  FITmed  =  2.70  is  slightly  higher  than  PIPE’S 
{Filmed  =  2.55).  Like  PIPE,  in  67%  of  all  runs  PIPE- 


SN  found  programs  with  fitness  below  4.  Its  worst 
programs  are  slightly  better  than  the  worst  program 
among  the  best  of  the  individual  PIPE  runs.  PIPE- 
SN’s  programs  [Nodcmed  =  117)  tend  to  be  smaller 
than  PIPE’S  {Nodcmed  =  157),  but  larger  than  H- 
PIPE’s  {NodCmed  =  92). 

We  observe  that  SNs  in  unstructured  PIPE  programs 
are  neither  harmful  nor  beneficial. 

PIPE  &  H-PIPE  vs.  H-PIPE-NO-SN.  H-PIPE- 
NO-SN  is  the  best  competitor,  slightly  better  than 
H-PIPE,  much  better  than  PIPE.  H-PIPE-NO-SN’s 
FITmed  —  2.38  is  roughly  as  good  as  H-PIPE’s 
FITmed  =  2.39.  In  91%  of  all  runs  ,  however,  H-PIPE- 
NO-SN  found  programs  with  fitness  below  4,  compared 
to  H-PIPE’s  82%  and  PIPE’S  67%.  Furthermore,  un¬ 
like  with  H-PIPE  and  PIPE,  no  program  found  by 
H-PIPE-NO-SN  has  fitness  above  7.39.  The  median 
size  of  H-PIPE-NO-SN  programs,  NodCmed  =  96,  is 
roughly  the  same  as  H-PIPE’s  {NodCmed  =  92)  and 
significantly  smaller  than  PIPE’S  {Nodcmed  =  157). 

We  observe  that  His  by  themselves  increase  PIPE’S 
performance.  Later  (in  Section  3.2)  we  will  see  that 
both  His  and  SNs  are  sometimes  needed  to  solve  cer¬ 
tain  tasks  more  efficiently.  But  first  we  will  illustrate 
the  importance  of  choosing  the  right  His. 

PIPE  &  H-PIPE  vs.  H-PIPE-DIFF.  H-PIPE- 
DIFF  performs  significantly  worse  than  H-PIPE  and 
PIPE.  The  fitness  of  the  best  program  found  by  H- 
PIPE-DIFF  in  50  independent  runs  is  only  7.52.  H- 
PIPE-DIFF’s  median  fitness  FITmed  =  10.62.  Com¬ 
pare  H-PIPE’s  and  PIPE’S,  which  are  2.39  and  2.55, 
respectively. 

This  demonstrates,  not  unexpectedly,  that  appropriate 
initial  bias  due  to  “good”  His  is  crucial  to  H-PIPE’s 
success. 

Conclusion.  His  can  increase  PIPE’S  performance 
significantly.  They  need  to  be  selected  carefully,  how¬ 
ever.  SNs  do  not  contribute  much  to  solving  the  func¬ 
tion  regression  task.  In  case  of  PIPE  they  reduce  pro¬ 
gram  size  without  affecting  solution  quality.  In  case 
of  H-PIPE  they  have  a  slightly  detrimental  effect  on 
overall  performance. 

The  next  experiment  will  show  that  for  some  tasks  only 
the  combination  of  His  and  SNs  leads  to  significant 
performance  improvement. 
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Figure  7;  Results  for  the  regression  problem. 


3.2  6-Bit  Parity 

The  6-bit  parity  function  has  six  Boolean  arguments 
represented  by  integers:  1  for  true  and  0  for  false.  It 
returns  1  if  the  number  of  nonzero  arguments  is  odd 
and  0  otherwise.  The  fitness  of  a  program  is  the  num¬ 
ber  of  patterns  it  classifies  incorrectly.  Best  (worst) 
fitness  for  classifying  all  (no)  patterns  correctly  is  0 
(64).  We  use  all  64  patterns  for  training. 

Set-up.  We  time-constrain  all  runs  to  PE  =  500,000 
and  use  the  following  parameter  settings  empirically 
found  to  work  well:  Pt=Pto=Pt'=0.6,  e  =  0.000001, 
Pe/=0.01,  P5=10,  /r=0.01,  Pm=0.4,  mr=-0A,  Tr=0.3, 
rp=0. 999999,  FITa  =  0  (see  Salustowicz  and  Schmid- 
huber  (1997)  for  detailed  description  of  parameters). 
Note  that,  except  for  Pt,  Pto,  and  Pj-i,  all  param¬ 
eters  are  set  to  the  same  values  as  for  the  function 
regression  task  (see  Section  3.1).  Most  of  PIPE’S 
and  H-PIPE’s  parameters  seem  robust  with  respect 
to  changing  tasks.  We  use  the  following  instruction 
sets:  (1)  PIPE:  F  =  {-b,-,*,  %,,sin,cos,exp,rlog}, 


T  =  {xo,xi,X2,X3,X4,X5,R}-,  (2)  H-PIPE-NO- 

SN:  pi  =  {*,%},  pi  =  {io},  P"  =  {+, 

-,sin,cos,exp,rlog],  T°  -  {xo,xi,X2,X3,X4,X5,R}; 
(3)  PIPE-SN:  P  =  %,sin,cos,exp,rlog,—*o 

T  -  {xo,xi,X2,X3,X4,X5,Ry,  (4)  H-PIPE: 

pi  =  {*,%,^i},  Ti  =  {io},  P“  =  {+, 

-,sin,cos,exp,rlog],  T°  =  {xo,xi,X2,X3,X4,X5,R}; 
(5)  H-PIPE-DIFF:  pi  =  T*  =  {io 

},  P®  =  {*,  %,  sin,  cos,  exp,  r log},  T°  =  {a:o,a:i, 
X2,X3,X4,i5,P}.  To  fit  the  Boolean  nature  of  the 
problem  the  real-valued  output  of  a  program  is 
mapped  to  0  if  negative  and  to  1  otherwise. 

Results.  Table  1  summarizes  all  results.  The  first 
column  displays  for  each  algorithm  the  percentage  of 
independent  runs  leading  to  perfect  solutions  within 
the  given  time  frame  (PP).  The  next  three  columns 
show  the  numbers  of  program  evaluations  necessary 
to  find  perfect  solutions  in  the  shortest,  median,  and 
longest  run,  respectively.  The  final  three  columns  list 
the  minimal,  median,  and  maximal  program  sizes  em¬ 
bodying  perfect  solutions. 
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Table  1:  Summary  of  6-bit  parity  results.  Best  values  are  in  boldface. 


6-bit  parity 

Algorithm 

solved 

Program  Evaluations 
min-  med  -max 

Nodes 

min-med-max 

H-PIPE 

94  % 

5,700-37,460-397,000 

23-  61  -96 

PIPE 

79  % 

3,520-79,950-497,220 

24-  64  -137 

PBPE-SN 

76  % 

1,676-73,720-487,930 

25-  58  -110 

H-PIPE-NO-SN 

66  % 

3,720-166,740^68,950 

21-  49  -85 

H-PIPE-DIFF 

28  % 

38,300-216,570-457,330 

24-  61  -94 

Comparison.  H-PIPE  performs  best.  It  solves  the 
task  more  often  and  significantly  faster  (with  less  pro¬ 
gram  evaluations)  than  PIPE,  PIPE  with  SNs,  and  H- 
PIPE  without  SNs.  PIPE  and  PIPE-SN  have  roughly 
the  same  performance.  PIPE-SN  finds  slightly  fewer 
solutions,  but  is  faster  than  PIPE  in  the  median  run. 
The  median  size  of  its  solutions  is  also  slightly  smaller 
than  pipe’s.  Although  its  solution  size  is  smallest 
in  the  median  run,  H-PIPE-NO-SN  performs  signifi¬ 
cantly  worse  than  PIPE  and  PIPE-SN.  It  finds  fewer 
solutions  and  requires  more  than  twice  as  many  pro¬ 
gram  evaluations  (in  the  median  run).  H-PIPE-DIFF 
with  wrong  initial  bias  is  worst  of  all.  It  needs  more 
than  five  times  as  many  program  evaluations  as  H- 
PIPE  to  find  roughly  three  times  fewer  solutions. 

Conclusion.  With  this  particular  task  H-PIPE  out¬ 
performs  PIPE.  Neither  SNs  by  themselves  nor  His  by 
themselves  are  able  to  improve  PIPE’S  performance. 
In  absence  of  structure  SNs’  effects  are  neither  harm¬ 
ful  nor  beneficial,  while  His  by  themselves  decrease 
pipe’s  performance.  The  combination  of  both  His 
(embodying  the  proper  initial  bias)  and  SNs  in  H- 
PIPE,  however,  allows  for  significant  improvement. 

4  Conclusion 

H-PIPE,  a  novel  method  for  synthesizing  structured 
programs,  uses  hierarchical  instructions  (His)  to  struc¬ 
ture  programs  and  skip  nodes  (SNs)  to  facilitate  their 
synthesis.  His  combine  program  parts,  while  SNs  al¬ 
low  for  non-coding  segments.  In  our  experiments.  His 
by  themselves  sometimes  worked  extremely  well,  but 
not  always.  Then,  however,  combining  them  with 
SNs  helped  to  achieve  dramatic  improvement.  SNs 
by  themselves  were  useless  for  improving  performance. 
Our  review  of  previous  work  on  non-coding  segments 
suggests  that  non-coding  segments  seem  to  require 
structured  code  to  unfold  their  benefits.  Our  own  re¬ 
sults  add  further  empirical  evidence  in  this  vein. 


Limitations  and  Future  Work.  His  are  chosen  a 
priori  —  currently  there  is  no  recipe  for  finding  the 
optimal  ones.  But  it  may  be  possible  to  automatize 
the  HI  selection  process  itself  by  making  it  subject  to 
data-driven  evolutionary  optimization. 
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Abstract 

This  paper  presents  results  from  the  first  at¬ 
tempt  to  apply  Transformation-Based  Learn¬ 
ing  to  a  discourse- level  Natural  Language 
Processing  task.  To  address  two  limita¬ 
tions  of  the  standard  algorithm,  we  developed 
a  Monte  Carlo  version  of  Transformation- 
Based  Learning  to  make  the  method 
tractable  for  a  wider  range  of  problems 
without  degradation  in  accuracy,  and  we 
devised  a  committee  method  for  assigning 
confidence  measures  to  tags  produced  by 
Transformation-Based  Learning.  The  pa¬ 
per  describes  these  advances,  presents  ex¬ 
perimental  evidence  that  Transformation- 
Based  Learning  is  as  effective  as  alterna¬ 
tive  approaches  (such  as  Decision  Trees 
and  N-Grams)  for  a  discourse  task  called 
Dialogue  Act  Tagging,  and  argues  that 
Transformation-Based  Learning  has  desirable 
features  that  make  it  particularly  appealing 
for  the  Dialogue  Act  Tagging  task. 


1  INTRODUCTION 

Transformation-Based  Learning  is  a  relatively  new 
machine  learning  method,  which  has  been  as  effec¬ 
tive  as  any  other  approach  on  the  Part-of-Speech 
Tagging  problem^  (Brill,  1995a).  We  are  utilizing 
Transformation-Based  Learning  for  another  important 
language  task  called  Dialogue  Act  Tagging,  in  which 
the  goal  is  to  label  each  utterance  in  a  conversational 
dialogue  with  the  proper  dialogue  act.  A  dialogue  act 
is  a  concise  abstraction  of  a  speaker’s  intention,  such  as 
SUGGEST  or  ACCEPT.  Recognizing  dialogue  acts  is 
critical  for  discourse-level  understanding  and  can  also 

^The  goal  of  this  Natural  Language  Processing  task  is 
to  label  words  with  the  proper  part  of  speech  tags,  such  as 
Noun  and  Verb. 


be  useful  for  other  applications,  such  as  resolving  am¬ 
biguity  in  speech  recognition.  But  computing  dialogue 
acts  is  a  challenging  task,  because  often  a  dialogue  act 
cannot  be  directly  inferred  from  a  literal  reading  of  an 
utterance.  Figure  1  presents  a  hypothetical  dialogue 
that  has  been  labeled  with  dialogue  acts. 

Our  research  efforts  led  us  to  address  some  limitations 
of  Transformation-Based  Learning.  We  developed  a 
Monte  Carlo  version  of  the  algorithm  that  overcomes 
the  limitation  of  Transformation-Based  Learning’s  de¬ 
pendence  on  manually-generated  rule  templates  and 
enables  Transformation-Based  Learning  to  be  applied 
effectively  to  a  wider  range  of  tasks.  We  also  devised 
a  technique  that  uses  a  committee  of  learned  models 
to  derive  confidence  measures  associated  with  the  dia¬ 
logue  acts  assigned  to  utterances. 

We  experimentally  compared  our  modified  version  of 
Transformation-Based  Learning  with  C5.0,  an  imple¬ 
mentation  of  Decision  Trees,  and  N-Grams,  which  was 
previously  the  best  reported  method  for  Dialogue  Act 
Tagging  (Reithinger  and  Klesen,  1997).  Our  system 
performs  as  well  as  these  benchmarks,  and  we  note 
that  Transformation-Based  Learning  has  several  char¬ 
acteristics  that  make  it  particularly  appealing  for  the 
Dialogue  Act  Tagging  task. 

This  paper  begins  with  an  overview  of  the 
Transformation-Based  Learning  method,  describing 
the  training  phase  and  the  application  phase  of  the  al¬ 
gorithm  and  presenting  some  of  Transformation-Based 
Learning’s  most  attractive  characteristics  for  Dialogue 
Act  Tagging.  The  following  section  describes  the  ex¬ 
perimental  design  used  for  the  experiments  presented 
in  the  paper.  Then  Section  4  presents  two  limi¬ 
tations  of  Transformation-Based  Learning,  a  depen¬ 
dence  on  rule  templates  and  a  lack  of  confidence  mea¬ 
sures,  and  describes  our  solutions  for  these  problems, 
a  Monte  Carlo  strategy  and  a  committee  method. 
Next  we  present  an  experimental  comparison  between 
Transformation-Based  Learning,  N-Grams,  and  Deci¬ 
sion  Trees,  and  conclude  with  a  discussion  of  this  work. 
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#  Speaker 


Utterance 


Dialogue  Act 


T  John  Hello. 

2  John  I’d  like  to  meet  with  you  on  Tuesday  at  2:00. 

3  Mary  That’s  no  good  for  me, 

4  Mary  but  I’m  free  at  3:00. 

5  John  That  sounds  fine  to  me. 

6  John  I’ll  see  you  then. 


GREET 

SUGGEST 

REJECT 

SUGGEST 

ACCEPT 

BYE 


Figure  1:  A  sample  dialogue 


2  TRANSFORMATION-BASED 
LEARNING 

Brill  (1995a)  developed  a  symbolic  machine  learn¬ 
ing  method  called  Transformation-Ba.sed  Learning. 
Given  a  tagged  training  corpus,  Transformation-Ba.sed 
Learning  produces  a  sequence  of  rules  that  serves  as  a 
model  of  the  training  data.  Then,  to  derive  the  ap¬ 
propriate  tags,  each  rule  may  be  applied,  in  order, 
to  each  instance  in  an  untagged  corpus.  For  all  of 
the  results  and  examples  in  this  paper,  we  are  using 
Transformation-Based  Learning  on  the  Dialogue  Act 
Tagging  task,  so  the  instances  are  utterances  and  the 
tags  are  dialogue  acts.  In  one  experiment,  our  system 
produced  a  learned  model  with  213  rules;  the  first  five 
rules  are  presented  in  Figure  2. 


# 

Condition(s) 

New 

Dialogue  Act 

1 

none 

SUGGEST 

2 

Includes  “see”  and  “you” 

BYE 

3 

Includes  “sounds” 

AGGEPT 

4 

Length  <  4  words 

Previous  tag  is  none? 

GREET 

5 

Includes  “no” 

Previous  tag  is  SUGGEST 

REJECT 

Figure  2:  Rules  produced  by  Transformation-Based 
Learning  for  Dialogue  Act  Tagging 


2.1  THE  TRAINING  PHASE 

The  training  phase  of  TBL,  in  which  the  system  learns 
a  sequence  of  rules  based  on  a  tagged  training  corpus, 
proceeds  in  the  following  manner: 

1.  Label  each  instance  with  a  dummy  tag. 

2.  Until  no  useful  rules  are  found, 

a.  For  each  incorrect  tag 

i.  Generate  all  rules  that 
correct  the  tag. 

b.  Score  each  generated  rule. 

c.  Output  the  highest  scoring  rule. 

d.  Apply  this  rule  to  the  corpus. 

^This  condition  is  true  only  for  the  first  utterance  of  a 
dialogue. 


First,  the  system  initializes  the  training  corpus  by  la¬ 
beling  each  instance  with  a  dummy  tag.  Brill  (1995a) 
suggested  using  a  more  complex  initialization  step,  but 
we  found  that  this  simple  strategy  is  more  effective  in 
practice.^  Then  the  system  generates  all  of  the  poten¬ 
tial  rules  that  would  make  at  least  one  tag  in  the  train¬ 
ing  corpus  correct,  under  the  restrictions  described  be¬ 
low.  For  each  potential  rule,  its  improvement  score  is 
defined  to  be  the  number  of  correct  tags  in  the  train¬ 
ing  corpus  after  applying  the  rule  minus  the  number  of 
correct  tags  in  the  training  corpus  before  applying  the 
rule.  The  potential  rule  with  the  highest  improvement 
score  is  output  as  the  next  rule  in  the  final  model  and 
applied  to  the  entire  training  corpus.  This  process  re¬ 
peats  (using  the  updated  tags  on  the  training  corpus), 
producing  one  rule  for  each  pass  through  the  training 
corpus  until  no  rule  can  bo  found  with  an  improve¬ 
ment  score  that  surpasses  some  predefined  threshold. 
In  practice,  threshold  values  of  1  or  2  appear  to  be 
effective. 

Since  there  are  potentially  an  infinite  number  of  rules 
that  could  produce  the  tags  in  the  training  data,  it  is 
neces.sary  to  restrict  the  range  of  patterns  that  the  sys¬ 
tem  may  consider  by  providing  a  set  of  rule  templates, 
such  as: 

IF  utterance  u  contains  the  word(s)  w 

AND  the  tag  on  the  utterance  preceding  u  is  X 
THEN  change  u’s  tag  to  Y 

This  template  can  be  instantiated  to  produce  the  last 
rule  in  Figure  2  by  setting  w=“no”,  X=SUGGEST, 
and  Y=REJECT. 

For  the  first  rules  of  the  learned  model,  the  emphasis 
is  on  getting  as  many  tags  correct  as  possible  with 
no  penalty  imposed  for  changing  an  incorrect  tag  to 
another  incorrect  tag.  Then  for  the  later  rules,  the 
system  must  avoid  changing  any  of  the  tags  that  are 


®Thi.s  is  because  Transformation-Based  Learning  uses 
an  error-driven  approach,  only  generating  rules  for  the  in¬ 
stances  that  are  incorrectly  labeled.  If  every  instance  is 
initialized  with  a  dummy  tag,  then  all  of  the  labels  are 
incorrect,  and  so  they  all  contribute  to  learning.  Alterna¬ 
tively,  using  a  more  inv'olved  initialization  step  results  in  a 
greater  number  of  correct  tag.s  and,  effectively,  less  training 
data. 


Transformation-Based  Learning  in  Discourse  499 


already  correct.  Thus,  this  method  tends  to  produce 
a  sequence  of  rules  that  progresses  from  general  rules 
to  specific  rules. 

2.2  THE  APPLICATION  PHASE 

To  see  how  a  rule  sequence  can  be  used  to  label  data, 
consider  applying  the  rules  in  Figure  2  to  the  dialogue 
in  Figure  1.  The  first  rule  labels  every  utterance  with 
the  dialogue  act  SUGGEST.  Next,  the  second  rule 
changes  an  utterance’s  tag  to  BYE  if  it  contains  the 
words  “see”  and  “you” ,  which  only  holds  for  utterance 
#6.  Similarly,  the  third  rule  changes  utterance  #5’s 
tag  to  ACCEPT.  Then  the  fourth  rule  tags  utterance 
#1  as  GREET,  since  its  length  is  1  and  there  is  no  pre¬ 
ceding  utterance  in  the  dialogue.  And  finally,  the  last 
rule  relabels  utterance  #3  as  REJECT,  since  utter¬ 
ance  #2  is  currently  tagged  SUGGEST,  and  the  word 
“no”  is  found  in  utterance  #3.  Although  the  first  five 
rules  label  these  six  utterances  correctly,  the  remain¬ 
ing  208  rules  in  the  sequence  may  continue  to  adjust 
the  tags  on  the  utterances. 

2.3  ATTRACTIVE  CHARACTERISTICS 

For  the  Dialogue  Act  Tagging  task,  we  selected 
Transformation-Based  Learning  for  several  reasons. 
Brill  reported  that  Transformation-Based  Learning  is 
as  good  as  or  better  than  any  other  algorithm  for  the 
Part-of-Speech  Tagging  problem,  labeling  97.2%  of  the 
words  correctly.  The  part-of-speech  tag  of  a  word  is  . 
dependent  on  the  word’s  internal  features  and  on  the 
surrounding  words;  similarly,  the  dialogue  act  of  an 
utterance  is  dependent  on  the  utterance’s  internal  fea¬ 
tures  and  on  the  surrounding  utterances.  This  parallel 
suggests  that  Transformation-Based  Learning  has  po¬ 
tential  for  success  on  the  Dialogue  Act  Tagging  prob¬ 
lem. 

Since  we  currently  lack  a  systematic  theory  of  dia¬ 
logue  acts,  another  reason  that  Transformation-Based 
Learning  is  an  attractive  choice  is  that  its  learned 
model  consists  of  relatively  intuitive  rules  (Brill, 
1995a),  which  a  human  can  analyze  to  determine  what 
the  system  has  learned  and  develop  a  working  theory. 
Also,  Transformation-Based  Learning  is  good  at  ig¬ 
noring  any  potential  rules  that  are  irrelevant.  This 
is  because  irrelevant  rules  tend  to  have  a  random  ef¬ 
fect  on  the  training  data,  which  usually  results  in 
low  improvement  scores,  so  these  rules  are  unlikely 
to  be  selected  for  inclusion  in  the  final  model.  This 
is  very  helpful  for  Dialogue  Act  Tagging,  since  we 
don’t  know  what  the  relevant  templates  are  for  this 
problem.  Ramshaw-  and  Marcus  (1994)  experimen¬ 
tally  demonstrated  Transformation-Based  Learning’s 
robustness  with  respect  to  irrelevant  rules. 

For  these  reasons,  along  with  others  that  are  pre¬ 


sented  at  the  end  of  the  paper,  we  believe  that 
Transformation-Based  Learning  is  worthy  of  investi¬ 
gation  for  the  Dialogue  Act  Tagging  task. 

3  EXPERIMENTAL  DESIGN 

All  of  the  results  presented  in  this  paper  followed  the 
same  experimental  design  as  the  third  experiment  in 
Reithinger  and  Klesen  (1997).  The  corpus  consisted  of 
appointment-scheduling  face-to-face  dialogues  in  En¬ 
glish,  which  was  divided  into  a  training  set  with  143 
dialogues  (2701  utterances)  and  a  disjoint  testing  set 
with  20  dialogues  (328  utterances).  Each  utterance 
was  manually  labeled  with  one  of  18  abstract  dia¬ 
logue  acts,  such  as  SUGGEST,  ACCEPT,  REJECT, 
GREET,  and  BYE.  The  full  list  of  dialogue  acts  is 
found  in  Reithinger  and  Klesen  (1997). 

The  Transformation-Based  Learning  experiments  pre¬ 
sented  in  this  paper  were  run  on  a  Sun  Ultra  1  ma¬ 
chine  with  508MB  of  main  memory.  Within  a  set  of 
experiments,  only  the  specified  parameters  were  var¬ 
ied,  but  between  sets  of  experiments  many  parameters 
may  have  been  varied,  so  it  is  not  possible  to  draw 
conclusions  across  experiment  sets. 

Our  rule  templates  consist  of  all  possible  combinations 
of  a  preselected  set  of  conditions.  Some  of  these  con¬ 
ditions  are  presented  in  Figure  3.  Each  condition  con¬ 
sists  of  a  feature  and  a  distance,  where  the  feature 
specifies  a  characteristic  of  utterances  that  might  be 
relevant  for  the  Dialogue  Act  Tagging  task,  and  the 
distance  specifies  the  relative  position  (from  the  utter¬ 
ance  under  analysis)  of  the  utterance  that  the  feature 
should  be  applied  to. 


Feature 

Distance 

length 

of  the 

current  utterance 

tag 

of  the 

preceding  utterance 

cue  patterns 

of  the 

current  utterance 

speaker 

of  the 

current  utterance 

speaker  of  the  preceding  utterance 


Figure  3:  Some  conditions  used  in  our  experiments 

In  discourse,  it  is  widely  acknowledged  that  some  of 
the  short  phrases  (and  specific  words)  found  in  an 
utterance  provide  strong  clues  to  determine  the  ap¬ 
propriate  dialogue  act.  Several  researchers  proposed 
different  cue  phrases,  which  are  phrases  that  appear 
frequently  in  dialogue  and  convey  useful  discourse  in¬ 
formation,  such  as  “but” ,  “so” ,  and  “by  the  way” .  Un¬ 
fortunately,  there  is  no  universal  agreement  on  which 
phrases  should  be  considered  cue  phrases,  and  in  a  pre¬ 
liminary  experiment  using  all  of  the  cue  phrases  pro¬ 
posed  in  the  literature,^  our  system’s  accuracy  only 

‘‘These  lists  of  cue  phrases  can  be  found  in  Hirschberg 
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improved  by  1.03%. 

In  order  to  identify  the  phra.se.s  that  will  be  useful  for  a 
particular  domain,  we  need  an  automatic  method  for 
collecting  a  set  of  phrases  that  is  tuned  to  that  do¬ 
main.  So  we  are  using  a  statistical  approach  to  select 
relevant  cue  patterns^  from  a  training  corpus.  Assum¬ 
ing  that  a  phrase  is  relevant  if  it  co-occurs  frequently 
with  a  few  specific  dialogue  acts,  we  analyze  the  dis¬ 
tribution  of  dialogue  acts  for  utterances  that  include  a 
given  phrase,  selecting  those  phrases  that  correspond 
to  dialogue  act  distributions  with  low  entropy.  When 
using  these  cue  patterns,  our  system’s  accuracy  rose 
by  17.63%.  For  more  details  on  this  work,  see  Samuel, 
Carberry,  and  Vijay-Shanker  (1998b). 

4  TRANSFORMATION-BASED 
LEARNING  IN  DISCOURSE 

4.1  TWO  LIMITATIONS 

Transformation-Based  Learning  has  two  serious  limi¬ 
tations,  which  we  will  address  in  this  section.  First, 
although  Transformation-Based  Learning  produces  a 
tag  for  each  instance,  it  doesn’t  offer  any  measure 
of  confidence  in  these  tags.  Alternatively,  probabili.s- 
tic  machine  learning  approaches  generally  label  an  in¬ 
stance  with  a  set  of  tags,  which  are  assigned  numbers 
to  represent  the  likelihood  that  they  are  correct.  So 
“probabilistic  methods  ...  provide  a  continuous  rank¬ 
ing  of  alternative  analyses  rather  than  just  a  single 
output,  and  such  rankings  can  productively  increase 
the  bandwidth  between  components  of  a  modular  sys¬ 
tem.”  (Brill  and  Mooney,  1997) 

The  second  limitation  of  Transformation-Based  Learn¬ 
ing  is  that  it  is  highly  dependent  on  the  rule  templates, 
which  are  manually  developed  in  advance.  Since  the 
omission  of  any  relevant  templates  would  handicap  the 
system,  it  is  essential  that  these  choices  be  made  care¬ 
fully.  But  in  Dialogue  Act  Tagging,  no  one  knows  ex¬ 
actly  which  conditions  and  combinations  of  conditions 
are  relevant,  so  it  is  preferable  to  err  on  the  side  of  cau¬ 
tion  by  constructing  an  overly-general  set  of  templates 
and  allowing  the  system  to  learn  which  templates  are 
useful.  As  discussed  earlier,  Transformation-Based 
Learning  is  capable  of  discarding  irrelevant  rules,  so 
this  approach  should  be  effective,  in  theory. 

Unfortunately,  this  strategy  is  not  tractable,  because 
for  each  pass  through  the  training  data,  for  each  in¬ 
stance  that  the  system  has  tagged  incorrectly,  every 
rule  template  must  be  instantiated  in  all  possible  ways. 


and  Litman  (1993)  and  Knott  (1996). 

®In  practice,  the  concept  of  cue  patterns  tends  to 
be  more  general  than  cue  phrases,  including  many  more 
phrases. 


Suppose  that  we  can  postulate  f  different  features  that 
might  be  relevant,  and  we  wish  to  consider  these  fea¬ 
tures  for  all  instances  that  occur  within  a  distance 
d  of  a  given  instance.  (In  other  words,  we  are  us¬ 
ing  a  contextual  window  of  size  2d-t-l.)  Then  there 
are  (2d  -h  l)f  conditions  and  possible  tem¬ 

plates,  since  each  condition  may  either  be  included  or 
excluded.  Also,  suppose  that  when  a  feature  is  applied 
to  an  instance,  it  produces  v  distinct  values,  on  aver¬ 
age.  This  results  in  (v  -f  i)(2d+i)f  instance, 

which  can  be  proven  by  induction  on  the  number  of 
conditions.  Given  a  training  corpus  with  i  instances, 
if  the  algorithm  makes  p  passes  through  the  train¬ 
ing  data,  then  the  system  must  generate  and  evaluate 
0(ip(v  -f  l)(2d+i)f^  rules.  Some  realistic  values  for 
these  variables  are  f=10,  d=2  (a  contextual  window 
of  size  5),  v=3,  i=3000,  and  p=100,  which  generates 
around  10^'’  rules.  Based  on  experimental  evidence, 
it  appears  that  it  is  necessary  to  drastically  limit  the 
number  of  potential  rules  that  the  system  generates,® 
or  the  memory  and  time  costs  are  so  exorbitant  that 
the  method  becomes  intractable.  But  this  limitation 
would  preclude  considering  all  of  the  features  and  fea¬ 
ture  interactions  that  might  be  relevant  for  Dialogue 
Act  Tagging. 

4.2  A  MONTE  CARLO  VERSION 

We  developed  a  Monte  Carlo  version  of 
Transformation-Based  Learning,  so  that  the  sys¬ 
tem  can  consider  a  huge  number  of  templates  while 
still  maintaining  tractability.  Rather  than  exhaus¬ 
tively  searching  through  the  space  of  possible  rules, 
only  R  of  the  available  template  instantiations  arc 
randomly  selected  for  each  training  instance  on  each 
pass  through  the  training  data,  where  R  is  some  small 
integer.  With  this  modification,  the  total  number 
of  rules  generated  is  only  O(ipR),  which  no  longer 
explodes  with  the  number  of  templates.  In  fact, 
the  formula  doesn’t  even  depend  on  the  number  of 
features,  the  contextual  window  size,  or  the  value  of 
V.  But  one  would  still  expect  good  results,  because 
Transformation-Based  Learning  only  needs  to  find  the 
best  rules,  and  the  best  rules  tend  to  bo  effective  for 
a  large  number  of  different  instances.  So  the  system 
has  many  opportunities  to  find  these  rules,  and  since 
the  algorithm  generally  makes  many  passes  through 
the  training  data  before  halting,  if  it  should  select  a 
suboptimal  rule,  it  can  use  later  rules  to  compensate. 
Thus,  although  random  sampling  will  miss  some  rules, 
it  is  still  highly  likely  to  find  an  effective  sequence  of 
rules. 

Our  experiments  confirm  these  intuitions,  as  shown 
in  Figures  4  and  5.  For  these  runs,  eight  condi- 

®For  the  Part-of-Speech  Tagging  task,  Brill  used  only 
about  30  simple  rule  templates  (Brill,  1995a). 
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Figure  4:  Number  of  conditions  vs.  training  time 


tions  were  preselected,  and  for  different  values  of  n, 
0  <n<  8,  the  first  n  conditions  were  combined  in  all 
possible  ways  to  generate  2”  templates.  Using  these 
templates,  we  trained,  tested,  and  compared  the  stan¬ 
dard  Transformation-Based  Learning  method  and  our 
Monte  Carlo  version  of  Transformation-Based  Learn¬ 
ing. 

For  the  standard  Transformation-Based  Learning 
method,  training  time  rises  dramatically  as  the  num¬ 
ber  of  conditions  increases,  as  shown  in  Figure  47 
In  fact,  when  given  seven  conditions,  the  standard 
Transformation-Based  Learning  algorithm  could  not 
complete  the  training  phase,  even  after  running  for 
more  than  24  hours.  But  our  Monte  Carlo  version 
of  Transformation-Based  Learning  keeps  the  efficiency 
relatively  stable.®  The  reason  for  the  slight  increase  in 
training  time  as  the  number  of  conditions  increases  is 


^The  value  of  v  (the  average  number  of  rules  generated 
per  instance)  varies  slightly  across  the  eight  conditions, 
and  so  the  shape  of  the  curve  might  vary  depending  on 
the  order  in  which  the  conditions  are  presented.  But  the 
critical  point  is  that  the  training  time  rises  exponentially 
with  the  number  of  conditions. 

®The  Monte  Carlo  version  of  Transformation-Based 
Learning  can  be  slower 'than  the  standard  method,  because 
the  Monte  Carlo  version  always  generates  R  rules  for  each 
instance,  without  checking  for  repetitions.  (It  would  be  too 
inefficient  to  prevent  the  system  from  generating  any  rule 
more  than  once.) 


that,  as  the  system  gains  access  to  a  greater  number 
of  useful  conditions,  it’s  likely  to  find  a  greater  num¬ 
ber  of  useful  rules,  meaning  that  the  training  phase 
makes  a  greater  number  of  passes  through  the  train¬ 
ing  data.  Thus,  p  increases,  and  so  the  training  time, 
O(ipR),  also  increases.  But  this  increase  is  linear  (or 
less),  while  standard  Transformation-Based  Learning’s 
training  time  increases  exponentially  with  the  number 
of  conditions.  Figure  4  supports  this  analysis. 

This  improvement  in  time  efficiency  would  be  quite  un¬ 
interesting  if  the  performance  of  the  algorithm  deteri¬ 
orated  significantly.  But,  as  Figure  5  shows,  this  is  not 
the  case.  Although  setting  R  too  low  (such  as  R=1  for 
7  and  8  conditions)  may  result  in  a  decrease  in  accu¬ 
racy,  the  lowest  possible  setting  (R=l)  is  as  accurate 
as  standard  Transformation-Based  Learning  for  6  con¬ 
ditions  (64  templates).  For  7  and  8  conditions,  train¬ 
ing  of  the  standard  Transformation-Based  Learning 
method  took  too  much  time,  so  those  results  could  not 
be  produced.  But,  as  the  curves  for  R=6  and  R=16  do 
not  differ  significantly,  it  is  reasonable  to  predict  that 
standard  Transformation-Based  Learning  would  pro¬ 
duce  similar  results  as  well.®  Therefore,  we  conclude 


®One  might  wonder  how  the  Monte  Carlo  version  of 
Transformation-Based  Learning  can  ever  do  better  than 
the  standard  Transformation-Based  Learning  method, 
which  occurred  for  the  experiments  that  used  five  con¬ 
ditions.  Because  Transformation-Based  Learning  is  a 
greedy  algorithm,  choosing  the  best  available  rule  on  each 
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that  our  Monte  Carlo  version  of  Transformation-Based 
Learning  (with  R=6)  works  effectively  for  more  than 
250  templates  (8  conditions)  in  only  about  15  minutes 
of  training  time. 

4.3  A  COMMITTEE  METHOD 

We  wanted  to  extend  Transformation-Based  Learning 
so  that  it  could  provide  some  idea  of  the  likelihood 
that  each  of  its  tags  are  correct.  So  we  attempted  to 
develop  a  strategy  for  assigning  confidence  measures 
to  the  rules  in  the  learned  model.  Then,  in  the  ap¬ 
plication  phase,  a  given  instance’s  confidence  mca,sure 
would  be  a  function  of  the  confidences  of  the  rules  that 
applied  to  that  instance.  Unfortunately,  due  to  the  na¬ 
ture  of  the  Transformation-Based  Learning  method, 
this  straightforward  approach  has  been  unsuccessful, 
because  the  rule  sequence  does  not  contain  enough 
information  to  derive  confidence  measures;  often,  the 
same  pattern  of  rules  applies  to  instances  that  should 
be  marked  with  high  confidence  as  well  as  instances 
that  should  be  marked  with  low  confidence. 

So,  for  the  purpose  of  computing  confidence  mea.sures, 
we  adapted  two  techniques  that  were  developed  for 
very  different  tasks.  The  Boosting  approach  has  been 
used  to  improve  accuracy  in  tagging  data  (Freund  and 
Schapire,  1996),  and  Committee-Based  Sampling  uti¬ 
lized  a  very  similar  strategy  to  minimize  the  required 

pass  through  the  training  data,  sometimes  the  standard 
Transformation-Based  Learning  method  selects  a  rule  that 
locks  it  into  a  local  maximum,  while  the  Monte  Carlo  ver¬ 
sion  might  fail  to  consider  this  attractive  rule  and  end  up 
producing  a  better  model. 


size  of  a  training  corpus  (Dagan  and  Engclson,  1995). 
We  applied  those  methods  to  compute  confidence  mea¬ 
sures,  by  training  the  system  a  number  of  times  to 
produce  a  few  different  but  reasonable  learned  models, 
which  are  called  committee  members.  Then  given  new 
data,  each  committee  member  independently  tags  the 
input,  and  a  given  tag’s  confidence  is  based  on  how 
well  the  committee  members  agree  on  that  tag.  We 
are  currently  defining  the  confidence  of  a  given  tag  to 
be  the  number  of  committee  members  that  preferred 
the  tag.  In  the  future,  we  will  investigate  confidence 
formulas  that  are  based  on  the  entropy  of  the  tags  se¬ 
lected  by  the  different  committee  members. 

We  considered  several  ways  to  develop  the  committee 
members,  and  we  decided  to  apply  the  strategy  that 
Freund  and  Schapire  (1996)  used  for  Boosting:  The 
first  committee  member  is  trained  in  the  standard  way, 
and  then  the  second  committee  member  pays  special 
attention  to  those  instances  in  the  training  data  that 
the  first  committee  member  did  not  tag  correctly.  To 
do  this  in  Transformation-Based  Learning,  we  adjust 
the  improvement  score  formula  to  weight  success  on 
these  “hard”  instances  more  heavily.  (In  effect,  it  is 
as  if  we  were  adding  multiple  copies  of  these  instances 
to  the  training  corpus.)  This  process  can  be  repeated 
to  generate  more  committee  members  by  basing  the 
score  for  correctly  tagging  a  training  instance  on  the 
number  of  previous  committee  members  that  tagged 
that  instance  incorrectly.  We  are  currently  using  2'" 
as  the  score  for  correctly  tagging  a  given  instance  that 
c  committee  members  have  mistagged.  This  strategy 
tends  to  produce  committee  members  that  are  very 
different,  as  they  are  focusing  on  different  parts  of  the 
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training  corpus. 


Minimum 

Confidence 

Percentage  of 
Instances  Tagged 

Average 

Precision 

5 

45.12%  ±  1.28% 

90.09%  ±  1.51% 

4 

69.79%  ±  1.60% 

83.53%  ±  1.27% 

3 

92.38%  ±  1.32% 

76.57%  ±  0.79% 

2 

99.85%  ±  0.20% 

73.56%  ±  1.10% 

1 

100.00%  ±  0.00% 

73.45%  ±  1.06% 

Figure  6:  Testing  the  committee  method  on  unseen 
data,  varying  the  minimum  confidence  considered 


As  a  preliminary  experiment  we  ran  ten  trials  with  five 
committee  members,  testing  on  held-out  data.  Fig¬ 
ure  6  presents  average  scores  and  standard  deviations, 
varying  the  minimum  confidence,  m.  For  a  given  in¬ 
stance,  if  at  least  m  committee  members  agreed  on 
a  tag,  then  the  most  popular  tag  was  applied,  break¬ 
ing  ties  in  favor  of  the  committee  member  that  was 
developed  the  earliest;  otherwise  no  tag  was  output. 
The  results  show  that  the  committee  approach  as¬ 
signs  useful  confidence  measures  to  the  tags:  All  five 
committee  members  agreed  on  the  tags  for  45.12%  of 
the  instances,  and  90.09%  of  those  tags  were  correct. 
Also,  for  69.79%  of  the  instances,  at  least  four  of  the 
five  committee  members  selected  the  same  tag,  and 
this  tag  was  correct  83.53%  of  the  time.  We  foresee 
that  our  module  for  tagging  dialogue  acts  can  poten¬ 
tially  be  integrated  into  a  larger  system  so  that,  when 
Transformation-Based  Learning  cannot  produce  a  tag 
with  high  confidence,  other  modules  may  be  invoked 
to  provide  more  evidence.  In  addition,  like  Boost¬ 
ing,  the  committee  method  improves  the  overall  ac¬ 
curacy  of  the  system.  By  selecting  the  most  popular 
tag  among  all  five  committee  members,  the  average  ac¬ 
curacy  in  tagging  unseen  data  was  73.45%,  while  using 
the  first  committee  member  alone  resulted  in  a  signifi¬ 
cantly  (t  =  5.42  >  2.88,  a  =  0.01)  lower  average  score 
of  70.79%. 

4.4  ALTERNATIVE  METHODS 

Previously,  the  best  success  rate  achieved  on  the  Dia¬ 
logue  Act  Tagging  problem  was  reported  by  Reithinger 
and  Klesen  (1997),  whose  system  used  a  probabilistic 
machine  learning  approach  based  on  N-Grams  to  cor¬ 
rectly  label  74.7%  of  the  utterances  in  a  test  corpus. 
(See  Samuel,  Carberry,  and  Vijay-Shanker  (1998a)  for 
a  more  extensive  analysis  of  previous  work  on  this 
task.)  As  a  direct  comparison,  we  applied  our  system 
to  exactly  the  same  training  and  testing  set.  Over 
five  runs,  the  system  achieved  an  average^"  accuracy 
of  75.12%±1.34%,  including  a  high  score^^  of  77.44%. 

^°The  variation  in  the  scores  is  due  to  the  random  nature 
of  the  Monte  Carlo  method. 

“The  rules  in  Figure  2  were  produced  in  this  experiment. 


In  addition,  we  ran  a  direct  comparison  between 
Transformation-Based  Learning  and  C5.0  (Rulequest 
Research,  1998),  which  is  an  implementation  of  the 
Decision  Trees  method.  The  accuracies  on  held-out 
data  for  training  sets  of  various  sizes  are  presented 
in  Figure  7.  For  Transformation-Based  Learning,  we 
averaged  the  scores  of  ten  trials  for  each  training  set 
(to  factor  out  the  random  eflFects  of  the  Monte  Carlo 
method),  and  the  standard  deviations  are  represented 
by  error  bars  in  the  graph.  These  experiments  did  not 
utilize  the  committee  method,  and  we  would  expect 
the  scores  to  improve  when  this  extension  is  used. 

With  C5.0,  we  wanted  to  use  the  same  features  that 
were  effective  for  Transformation-Based  Learning,  but 
we  encountered  two  problems;  1)  Since  C5.0  requires 
that  each  feature  take  exactly  one  value  for  each  in¬ 
stance,  it  is  very  difficult  to  utilize  the  cue  patterns 
feature.  We  decided  to  provide  one  boolean  feature 
for  each  possible  cue  pattern,  which  was  set  to  True 
for  instances  that  included  that  cue  pattern  and  False 
otherwise.  2)  Our  Transformation-Based  Learning  sys¬ 
tem  utilized  the  system-generated  tag^^  of  the  preced¬ 
ing  instance.  C5.0  cannot  use  this  information,  as  it 
requires  that  the  values  of  all  of  the  features  are  com¬ 
puted  before  training  begins. 

The  training  times  of  Transformation-Based  Learning 
and  C5.0  were  relatively  comparable  for  any  number 
of  conditions,  although  Boosting  sometimes  resulted 
in  a  significant  increase  in  training  time.  The  ac¬ 
curacy  scores  of  Transformation-Based  Learning  and 
C5.0,  with  and  without  Boosting,  are  not  significantly 
different,  as  shown  in  Figure  7. 

5  DISCUSSION 

This  paper  has  described  the  first  investigation  of 
Transformation-Based  Learning  applied  to  discourse- 
level  problems.  We  extended  the  algorithm  to  ad¬ 
dress  two  limitations  of  Transformation-Based  Learn¬ 
ing:  1)  We  developed  a  Monte  Carlo  version  of 
Transformation-Based  Learning,  and  our  experiments 
suggest  that  this  improvement  dramatically  increases 
the  efficiency  of  the  method  without  compromising  ac¬ 
curacy.  This  revision  enables  Transformation-Based 
Learning  to  work  effectively  on  a  wider  variety  of  tasks, 
including  tasks  where  the  relevant  conditions  and  con¬ 
dition  combinations  are  not  known  in  advance  as  well 
as  tasks  where  there  are  a  large  number  of  relevant 
conditions  and  condition  combinations.  This  improve¬ 
ment  also  decreases  the  labor  demands  on  the  human 
developer,  who  no  longer  needs  to  construct  a  mini- 

^^For  Transformation-Based  Learning,  the  tags  change 
as  the  system  applies  the  rules  in  the  learned  model.  When 
a  rule  references  a  tag,  it  uses  the  value  of  the  tag  at  the 
point  when  that  rule  is  processed. 
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Figure  7:  Training  set  size  vs.  tagging  accuracy  on  unseen  data 


mal  set  of  rule  templates.  It  is  sufficient  to  list  all  of 
the  conditions  that  might  be  relevant  and  allow  the 
system  to  consider  all  possible  combinations  of  those 
conditions.  2)  We  devised  a  committee  strategy  for 
computing  confidence  measures  to  represent  the  reli¬ 
ability  of  tags.  In  our  experiments,  this  committee 
method  improved  the  overall  tagging  accuracy  signif¬ 
icantly.  It  also  produced  useful  confidence  measures; 
nearly  half  of  the  tags  were  assigned  high  confidence, 
and  of  these,  90%  were  correct. 

For  the  Dialogue  Act  Tagging  task,  our  modified  ver¬ 
sion  of  Transformation-Based  Learning  has  achieved 
an  accuracy  rate  that  is  comparable  to  any  previously 
reported  system.  In  addition,  Transformation-Based 
Learning  has  a  number  of  features  that  make  it  par¬ 
ticularly  appealing  for  the  Dialogue  Act  Tagging  task: 

1.  Transformation-Based  Learning’s  learned  model 
consists  of  a  relatively  short  sequence  of  intuitive 
rules,  stressing  relevant  features  and  highlight¬ 
ing  important  relationships  between  features  and 
tags  (Brill,  1995a).  Thus,  Transformation-Based 
Learning’s  learned  model  offers  insights  into  a  the¬ 
ory  to  explain  the  training  data.  This  is  especially 
useful  in  Dialogue  Act  Tagging,  which  currently 
lacks  a  systematic  theory. 

2.  With  its  iterative  training  algorithm,  when  devel¬ 
oping  a  new  rule,  Transformation-Based  Learning 
can  consider  tags  that  have  been  produced  by  pre¬ 
vious  rules  (Ramshaw  and  Marcus,  1994).  Since 
the  dialogue  act  of  an  utterance  is  affected  by  the 
surrounding  dialogue  acts,  this  leveraged  learn¬ 
ing  approach  can  directly  integrate  the  relevant 


contextual  information  into  the  rules.  In  addi¬ 
tion,  Transformation-Based  Learning  can  accom¬ 
modate  the  focus  shifts  that  frequently  occur  in 
discourse  by  utilizing  features  that  consider  tags 
of  varying  distances. 

3.  Our  Transformation-Based  Learning  system  is 
very  flexible  with  respect  to  the  types  of  features 
it  can  utilize.  For  example,  it  can  learn  set- valued 
features,  such  as  cue  patterns.  Additionally,  be¬ 
cause  of  the  Monte  Carlo  improvement,  our  sys¬ 
tem  can  handle  a  very  large  number  of  features. 

4.  For  the  Dialogue  Act  Tagging  task,  people  still 
don’t  know  what  features  are  relevant,  so  it  is  very 
difficult  to  construct  an  appropriate  set  of  rule 
templates.  Fortunately,  IVansformation-Basod 
Learning  is  capable  of  discarding  irrelevant  rules, 
as  Ramshaw  and  Marcus  (1994)  showed  exper¬ 
imentally,  so  it  is  not  necessary  that  all  of  the 
given  rule  templates  be  useful. 

5.  Ramshaw  and  Marcus’s  (1994)  experiments  sug¬ 
gest  that  Transformation-Based  Learning  tends  to 
be  resistant  to  the  overfitting’ ^  problem.  This  can 
be  explained  by  observing  how  the  rule  sequence 
produced  by  Transformation-Based  Learning  pro¬ 
gresses  from  general  rules  to  specific  rules.  The 
early  rules  in  the  sequence  are  based  on  many  ex¬ 
amples  in  the  training  corpus,  and  so  they  are 
likely  to  generalize  effectively  to  new  data.  Later 
in  the  sequence,  the  rules  don’t  receive  as  much 

’^Other  machine  learning  algorithm.s  may  overfit  to  the 
training  data  and  then  have  difficulty  generalizing  to  new 
data. 
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support  from  the  training  data,  and  their  applica¬ 
bility  conditions  tend  to  be  very  specific,  so  they 
have  little  or  no  effect  on  new  data.  Thus,  resis¬ 
tance  to  overfitting  is  an  emergent  property  of  the 
Transformation-Based  Learning  algorithm. 

For  the  future,  we  intend  to  investigate  a  wider  variety 
of  features  and  explore  different  methods  for  collecting 
cue  patterns  to  increase  our  system’s  accuracy  scores 
further.  Although  we  compared  Transformation- 
Based  Learning  with  a  few  very  different  machine 
learning  algorithms,  we  still  hope  to  examine  other 
methods,  such  as  Naive  Bayes.  In  addition,  we  plan 
to  run  our  experiments  with  different  corpora  to  con¬ 
firm  that  the  encouraging  results  of  our  extensions  to 
Transformation-Based  Learning  can  be  generalized  to 
different  data,  languages,  domains,  and  tasks.  We 
would  also  like  to  extend  our  system  so  that  it  may 
learn  from  untagged  data,  as  there  is  still  very  little 
tagged  data  available  in  discourse.  Brill  developed  an 
unsupervised  version  of  Transformation-Based  Learn¬ 
ing  for  Part-of-Speech  Tagging  (Brill,  1995b),  but  this 
algorithm  must  be  initialized  with  instances  that  can 
be  tagged  unambiguously  (such  as  “the” ,  which  is  al¬ 
ways  a  determiner),  and  in  Dialogue  Act  Tagging  there 
are  very  few  unambiguous  examples.  We  intend  to 
investigate  the  following  weakly-supervised  approach; 
First,  the  system  will  be  trained  on  a  small  set  of 
tagged  data  to  produce  a  number  of  different  com¬ 
mittee  members.  Then  given  untagged  data,  it  will 
derive  tags  with  confidence  measures.  Those  tags  that 
receive  very  high  confidence  can  be  used  as  unam¬ 
biguous  examples  to  drive  the  unsupervised  version  of 
Transformation-Based  Learning. 
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Abstract 


We  study  the  classification  problem  that 
arises  when  two  variables — one  continu¬ 
ous  (x),  one  discrete  (s) — evolve  jointly  in 
time.  We  suppose  that  the  vector  x  traces 
out  a  smooth  multidimensional  curve,  to  each 
point  of  which  the  variable  s  attaches  a  dis¬ 
crete  label.  The  trace  of  s  thus  partitions  the 
curve  into  different  segments  whose  bound¬ 
aries  occur  where  s  changes  value.  We  con¬ 
sider  how  to  learn  the  mapping  between  x 
and  s  from  examples  of  segmented  curves. 

Our  approach  is  to  model  the  conditional  ran¬ 
dom  process  that  generates  segments  of  con¬ 
stant  s  along  the  curve  of  x.  We  suppose 
that  the  variable  s  evolves  stochastically  eis 
a  function  of  the  arc  length  traversed  by  *. 
Since  arc  length  does  not  depend  on  the  rate 
at  which  a  curve  is  traversed,  this  gives  rise 
to  a  family  of  Markov  processes  whose  pre¬ 
dictions,  Pr[s|a;],  are  invariant  to  nonlinear 
warpings  (or  reparameterizations)  of  time. 

We  show  how  to  learn  the  parameters  of  these 
Markov  processes  from  labeled  and/or  unla¬ 
beled  examples  of  segmented  curves.  The  re¬ 
sulting  models  are  motivated  for  automatic 
speech  recognition,  where  *  are  acoustic  fea¬ 
tures  and  s  are  phonetic  transcriptions. 

1  INTRODUCTION 

The  automatic  segmentation  of  continuous  trajecto¬ 
ries  poses  a  challenging  problem  in  machine  learning. 
The  problem  arises  whenever  a  multidimensional  tra¬ 
jectory  {a:(f)lf  6  [0,r]}  must  be  described  by  a  se¬ 


quence  of  discrete  labels  siS2  ■  ■  -  Sn-  A  simple  way  to 
map  trajectories  into  sequences  is  to  specify  consecu¬ 
tive  time  intervals  such  that  s(t)  =  Sk  for  t  6 
This  attaches  the  labels  s*  to  contiguous  arcs  along  the 
trajectory.  The  learning  problem  is  to  discover  such  a 
mapping  from  labeled  and/or  unlabeled  examples. 

In  this  paper,  we  study  this  problem,  paying  special 
attention  to  the  fact  that  curves  have  intrinsic  geomet¬ 
ric  properties  that  do  not  depend  on  the  rate  at  which 
they  are  traversed  (do  Carmo,  1976).  Such  properties 
include,  for  example,  the  total  arc  length  and  the  max¬ 
imum  distance  between  any  two  points  on  the  curve. 
Given  a  multidimensional  trajectory  {*(<)|f  G  [0,r]}, 
these  properties  are  invariant  to  reparameterizations 
i  -*  /(0>  where  /(/)  is  any  monotonic  function  that 
maps  the  interval  [0,  r]  into  itself.  Put  another  way, 
the  intrinsic  geometric  properties  of  the  curve  are  in¬ 
variant  to  nonlinear  warpings  of  time. 

Invariance  to  nonlinear  warpings  of  time  is  an  example 
of  a  mathematical  symmetry.  The  importance  of  such 
symmetries  in  statistical  pattern  recognition  (Duda  & 
Hart,  1973)  is  well-known.  For  example,  in  the  prob¬ 
lem  of  object  recognition  from  two  dimensional  images, 
one  often  incorporates  invariances  to  translations,  ro¬ 
tations,  and  changes  of  scale  (Simard  et  al,  1993).  In 
the  segmentation  of  continuous  trajectories,  one  natu¬ 
rally  encounters  the  question  of  invariance  to  nonlinear 
warpings  of  time.  A  better  understanding  of  this  in¬ 
variance  is  therefore  valuable  in  its  own  right.  Beyond 
its  mathematical  interest,  however,  the  principled  han¬ 
dling  of  this  invariance  suggests  new  algorithms  for  the 
automatic  segmentation  of  continuous  trajectories.  In¬ 
deed,  the  primary  motivation  for  this  work  is  its  po¬ 
tential  application  to  automatic  speech  recognition — a 
subject  to  which  we  return  in  the  final  section  of  the 
paper. 

The  study  of  curves  requires  some  simple  notions  from 
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differential  geometry.  As  a  matter  of  terminology,  we 
refer  to  particular  parameterizations  of  curves  as  tra¬ 
jectories.  We  regard  two  trajectories  a!i(t)  and  X2{t)  as 
equivalent  to  the  same  curve  if  there  exists  a  monoton- 
ically  increasing  function  /  for  which  a:i(t)  =  Xijfjt)). 
(To  be  precise,  we  mean  the  same  oriented  curve:  the 
direction  of  traversal  matters.)  Here,  as  in  what  fol¬ 
lows,  we  adopt  the  convention  of  using  x{t)  to  denote 
an  entire  trajectory  as  opposed  to  constantly  writing 
out  {a;(t)|t  G  [0,7-]}.  When  necessary  to  refer  to  the 
value  of  x{t)  as  a  particular  moment  in  time,  we  use  a 
different  index,  such  as  x{ti). 

Let  us  return  now  to  the  problem  of  automatic  segmen¬ 
tation.  Consider  two  variables — one  continuous  (®), 
one  discrete  (s) — that  evolve  jointly  in  time.  Thus  the 
vector  X  traces  out  a  smooth  multidimensional  curve, 
to  each  point  of  which  the  variable  s  attaches  a  discrete 
label.  Note  that  each  trace  of  s  yields  a  partition  of 
the  curve  into  different  components;  in  particular,  the 
boundaries  of  these  components  occur  at  the  points 
where  s  changes  value.  We  refer  to  such  partitions 
as  segmentations  and  to  the  regions  of  constant  s  as 
segments]  see  figure  1. 

Our  goal  in  this  paper  is  to  learn  a  probabilistic  map¬ 
ping  between  trajectories  x{t)  and  segmentations  s{t) 
from  labeled  and^r  unlabeled  examples.  Consider  the 
conditional  random  process  that  generates  segments 
of  constant  s  along  the  curve  traced  out  by  x.  Given 
a  trajectory  x(t),  let  Pr[s(t)  !*(<)]  denote  the  condi¬ 
tional  probability  distribution  over  possible  segmenta¬ 
tions.  Suppose  that  for  any  two  equivalent  trajectories 
x{t)  and  x(/(t)),  we  have  the  identity; 

Pr[s(t)  I  x(t)]  =  Pr[s(/(t))  1  x(/(t))].  (1) 

Eq.  (1)  captures  a  fundamental  invariance — namely, 
that  the  probability  that  the  curve  is  segmented  in 
a  particular  way  is  independent  of  the  rate  at  which 
it  is  traversed.  In  this  paper,  we  study  Markov  pro¬ 
cesses  with  this  property.  We  call  them  Markov  pro¬ 
cesses  on  curves  (MPCs)  because  for  these  processes 
it  is  unambiguous  to  write  Pr[s  |  x]  without  provid¬ 
ing  explicit  parameterizations  for  the  trajectories,  x(t) 
or  s(t).  The  distinguishing  feature  of  MPCs  is  that  the 
variable  s  evolves  as  a  function  of  the  arc  length  tra¬ 
versed  along  X,  a  quantity  that  is  manifestly  invariant 
to  nonlinear  warpings  of  time. 

The  main  contributions  of  this  paper  are;  (i)  to  pos¬ 
tulate  eq.  (1)  as  a  fundamental  invariance  of  random 
processes;  (ii)  to  introduce  MPCs  as  a  family  of  prob¬ 
abilistic  models  that  capture  this  invariance;  (iii)  to 
derive  monotonically  convergent  learning  procedures 


Figure  1:  Two  variables — one  continuous  (x),  one  dis¬ 
crete  (s) — evolve  jointly  in  time.  The  trace  of  s  par¬ 
titions  the  curve  of  x  into  different  segments  whose 
boundaries  occur  where  s  changes  value.  Markov  pro¬ 
cesses  on  curves  model  the  conditional  distribution, 
Pr[slx]. 

for  MPCs  based  on  the  principle  of  maximum  like¬ 
lihood  estimation;  and  (iv)  to  contrast  the  proper¬ 
ties  of  MPCs  with  those  of  hidden  Markov  models 
(HMMs),  especially  as  they  relate  to  problems  in  au¬ 
tomatic  speech  recognition  (Rabiner  k  Juang,  1993). 
In  terms  of  previous  work,  our  motivation  most  closely 
resembles  that  of  Tishby  (1990),  who  several  years  ago 
proposed  a  dynamical  system  approach  to  speech  pro¬ 
cessing. 

The  organization  of  this  paper  is  as  follows.  In  sec¬ 
tion  2,  we  begin  by  reviewing  some  basic  concepts 
from  differential  geometry.  We  then  introduce  MPCs 
as  a  family  of  continuous-time  Markov  processes  that 
parameterize  the  conditional  probability  distribution, 
Pr[s  1  x].  The  processes  are  derived  from  a  set  of  differ¬ 
ential  equations  that  describe  the  pointwise  evolution 
of  s  along  the  curve  traced  out  by  x. 

In  section  3,  we  consider  how  to  learn  the  parameters 
of  MPCs  in  both  supervised  and  unsupervised  settings. 
These  settings  correspond  to  whether  the  learner  has 
access  to  labeled  or  unlabeled  examples.  Labeled  ex¬ 
amples  consist  of  trajectories  x(t),  along  with  their 
corresponding  segmentations: 

{start -t- (si,ti)  ••  •  (s„,t„) -*■  end}.  (2) 

The  ordered  pairs  in  eq.  (2)  indicate  that  s{t)  takes 
the  value  s*  between  times  tk-i  and  tk]  the  START 
and  END  states  are  used  to  mark  endpoints.  Unlabeled 
examples  consist  only  of  the  trajectories  x{t)  and  the 
boundary  values: 

{(0,  start)  — >•  (r,END)}.  (3) 

Eq.  (3)  specifies  only  that  the  Markov  process  starts 
at  time  t  =  0  and  terminates  at  some  later  time  r.  In 
this  case,  the  learner  must  infer  its  own  target  values 
for  s{t)  in  order  to  update  its  parameter  estimates.  We 
view  both  types  of  learning  as  instances  of  maximum 
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likelihood  estimation  and  describe  an  Expectation- 
Maximization  (EM)  algorithm  for  the  more  general 
case  of  unlabeled  (or  partially  labeled)  examples. 

In  section  4,  we  discuss  the  application  of  MFCs  to  au¬ 
tomatic  speech  recognition  (Rabiner  k.  Juang,  1993). 
Here  we  can  identify  the  curves  x  with  time-varying 
spectral  signatures  and  the  segmentations  s  with  pho¬ 
netic  transcriptions.  We  discuss  possible  advantages  of 
MFCs  over  hidden  Markov  models,  the  current  lead¬ 
ing  technology  for  automatic  speech  recognition.  The 
most  important  of  these  are;  (i)  the  natural  han¬ 
dling  of  variations  in  speaking  rate — i.e.,  the  rate  at 
which  acoustic  features  (summarized  by  x)  change 
with  time — and  (ii)  the  emphasis  on  learning  a  recog¬ 
nition  model  Fr[s|a:],  as  opposed  to  a  synthesis  model 
Pr[a!|s].  Finally,  we  conclude  by  outlining  our  plans 
for  future  work. 

2  MARKOV  PROCESSES  ON 
CURVES 

Markov  processes  on  curves  are  based  fundamentally 
on  the  notion  of  arc  length.  After  reviewing  how  to 
compute  arc  lengths  along  curves,  we  show  how  they 
can  be  used  to  define  random  processes  that  capture 
the  invariance  of  eq.  (1). 

2.1  ARC  LENGTH 

Let  ^(a;)  define  a  D  x  D  matrix- valued  function  over 
*  G  TZ^ .  If  g{x)  is  everywhere  non-negative  definite, 
then  we  can  use  it  as  a  metric  to  compute  distances 
along  curves.  In  particular,  consider  two  nearby  points 
separated  by  the  infinitesimal  vector  dx.  We  define  the 
squared  distance  between  these  two  points  as: 

d£^  =  dx^g{x)  dx.  (4) 

Arc  length  along  a  curve  is  the  non-decreasing  function 
computed  by  integrating  these  local  distances.  Thus, 
for  the  trajectory  x{t),  the  arc  length  between  the 
points  a5(<i)  and  *(<2)  is  given  by: 

£=  dt  x^g{x)x  ^  ,  (5) 

Jti 

where  x  =  ^[a;(t)]  denotes  the  time  derivative  of  x. 
Note  that  the  arc  length  between  two  points  is  in¬ 
variant  under  reparameterizations  of  the  trajectory, 
x{i)  — >  x{f{t)),  where  /(<)  is  any  smooth  monotonic 
function  of  time  that  maps  the  interval  [<1,^2]  into  it¬ 
self. 


In  the  special  case  where  ^(a;)  is  the  identity  ma¬ 
trix,  eq.  (5)  reduces  to  the  standard  definition  of  arc 
length  in  Euclidean  space.  More  generally,  however, 
eq.  (4)  defines  a  non-Euclidean  metric  for  computing 
arc  lengths.  Thus,  for  example,  if  the  metric  g{x) 
varies  as  a  function  of  x,  then  eq.  (5)  can  assign  differ¬ 
ent  arc  lengths  to  the  trajectories  x(t)  and  x{t)  -f  xqi 
where  *0  is  a  constant  displacement. 

2.2  STATES  AND  LIFETIMES 

The  problem  of  segmentation  is  to  map  a  trajectory 
x{t)  into  a  sequence  of  discrete  labels  siS2  ..Sn’  If 
these  labels  are  attached  to  contiguous  arcs  along  the 
curve  of  x,  then  we  can  describe  this  sequence  by  a 
piecewise  constant  function  of  time,  s{t),  as  in  figure  1. 
We  refer  to  the  possible  values  of  s  as  states.  In  what 
follows,  we  introduce  a  family  of  conditional  random 
processes  that  evolve  s  as  a  function  of  the  arc  length 
traversed  along  the  curve  traced  out  by  x.  These  ran¬ 
dom  processes  are  based  on  a  simple  premise — namely, 
that  the  probability  of  remaining  in  a  particular  state 
decays  exponentially  with  the  cumulative  arc  length 
traversed  in  that  state.  The  signature  of  a  state  is 
the  particular  way  in  which  it  computes  arc  length. 

To  formalize  this  idea,  we  associate  with  each  state  i 
the  following  quantities:  (i)  a  position-dependent  ma¬ 
trix  gi{x)  that  can  be  used  to  compute  arc  lengths,  as 
in  eq.  (5);  (ii)  a  decay  parameter  A,-  that  measures  the 
probability  per  unit  arc  length  that  s  makes  a  transi¬ 
tion  from  state  i  to  some  other  state;  and  (iii)  a  set 
of  transition  probabilities  a,y,  where  a,y  represents  the 
probability  that — having  decayed  out  of  state  i — the 
variable  s  makes  a  transition  to  state  j.  Thus,  a,y  de¬ 
fines  a  stochastic  transition  matrix  with  zero  elements 
along  the  diagonal  and  rows  that  sum  to  one;  a,-,-  =  0 
and  Oij  =  1- 

Together,  these  quantities  can  be  used  define  a  Markov 
process  along  the  curve  traced  out  by  *.  In  particular, 
let  pi{t)  denote  the  probability  that  s  is  in  state  i  at 
time  t,  based  on  its  history  up  to  that  point  in  time. 
A  Markov  process  is  defined  by  the  set  of  differential 
equations; 

^  =  -A.p.  x'^giix)  X  ]  '  +J2  ^jPjaji 

The  right  hand  side  of  eq.  (6)  consists  of  two  compet¬ 
ing  terms.  The  first  term  computes  the  probability 
that  s  decays  out  of  state  f;  the  second  computes  the 
probability  that  s  decays  into  state  i.  Both  probabil¬ 
ities  are  proportional  to  measures  of  arc  length,  and 
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combining  them  gives  the  overall  change  in  probability 
that  occurs  in  the  time  interval  [t,t  +  dt].  The  process 
is  Markovian  because  the  evolution  of  pi  depends  only 
on  quantities  available  at  time  t\  thus  the  future  is 
independent  of  the  past  given  the  present. 

Eq.  (6)  has  certain  properties  of  interest.  First,  note 
that  summing  both  sides  over  i  gives  the  identity 
Yj,idpi/dt  =  0.  This  shows  that  pi  remains  a  nor¬ 
malized  probability  distribution:  i.e.,  =  1  at  all 

times.  Second,  suppose  that  we  start  in  state  i  and 
do  not  allow  return  visits:  i.e.,  Pi  =  1  and  Uj*  =  0  for 
all  j.  In  this  case,  the  second  term  of  eq.  (6)  vanishes, 
and  we  obtain  a  simple,  one-dimensional  linear  differ¬ 
ential  equation  for  Pi{t).  It  follows  that  the  probability 
of  remaining  in  state  i  decays  exponentially  with  the 
amount  of  arc  length  traversed  by  x,  where  arc  length 
is  computing  using  the  matrix  gi{x).  The  decay  pa¬ 
rameter,  Aj,  controls  the  typical  amount  of  arc  length 
traversed  in  state  i;  it  may  be  viewed  as  an  inverse 
lifetime  or — to  be  more  precise — an  inverse  lifelength. 
Finally,  noting  that  arc  length  is  a  reparameterization- 
invariant  quantity,  we  therefore  observe  that  these  dy¬ 
namics  capture  the  fundamental  invariance  of  eq.  (1). 

2.3  INFERENCE 


second  product  multiplies  the  probabilities  for  transi¬ 
tions  between  states  sj,  and  Sjb+i.  The  leading  factors 
of  Xs^  are  included  to  normalize  each  state’s  duration 
model. 

There  are  many  important  quantities  that  can  be  com¬ 
puted  from  the  distribution,  Pr[s|x].  Of  particular  in¬ 
terest  is  the  most  probable  segmentation: 

s*  =  argmax  |lnPr[s|x]j.  (9) 


Given  a  particular  trajectory  x(f),  eq.  (9)  calls  for  a 
maximization  over  all  piecewise  constant  functions  of 
time,  s{t).  In  practice,  this  maximization  can  be  per¬ 
formed  by  discretizing  the  time  axis  and  applying  a 
dynamic  programming  procedure.  The  resulting  seg¬ 
mentations  will  be  optimal  at  some  finite  temporal 
resolution.  At.  For  example,  let  ai{t)  denote  the  log- 
likelihood  of  the  most  probable  segmentation,  ending 
in  state  i,  of  the  subtrajectory  up  to  time  t.  Starting 
from  the  initial  condition  ai(0)  =  ln[aoi],  we  compute 


ocjft  +  At) 


max 

+ 


-  XiAt 


ln[Ajatj](l 


Let  aoi  denote  the  probability  that  the  variable  s 
makes  an  immediate  transition  from  the  START  state 
denoted  by  the  zero  index— to  state  j;  put  another  way, 
this  is  the  probability  that  the  first  segment  belongs  to 
state  i.  Given  a  trajectory  x{t),  the  Markov  process 
in  eq.  (6)  gives  rise  to  a  conditional  probability  distri¬ 
bution  over  possible  segmentations,  s(t).  Consider  the 
segmentation  in  which  s{t)  takes  the  value  sj,  between 
times  tk-i  and  tk,  and  let 


(7) 


denote  the  arc  length  traversed  in  state  s*,.  From 
eq.  (6),  we  know  that  the  probability  of  remaining  in 
a  particular  state  decays  exponentially  with  this  arc 
length.  Thus,  the  conditional  probability  of  this  seg¬ 
mentation  is  given  by: 


Fv[s\x]  = 

fc  =  l 


n 

J[^  +  l  1 

fc  =  0 


(8) 


where  6ij  is  the  discrete  delta  function.  Also,  at  each 
time  step,  let  'Fj  {t+At)  record  the  value  of  i  that  max¬ 
imizes  the  right  hand  side  of  eq.  (10).  Suppose  that 
the  Markov  process  terminates  at  time  r.  Enforcing 
the  endpoint  condition  s*('r)  =  END,  we  find  the  most 
likely  segmentation  by  back-tracking: 

At)  =  )(t).  (11) 

These  recursions  yield  a  segmentation  that  is  opti¬ 
mal  at  some  finite  temporal  resolution  At.  Gener¬ 
ally  speaking,  by  choosing  At  to  be  sufficiently  small, 
one  can  minimize  the  errors  introduced  by  discretiza¬ 
tion.  In  practice,  one  would  choose  At  to  reflect  the 
time  scale  beyond  which  it  is  not  necessary  to  consider 
changes  of  state. 

Other  types  of  inferences  can  also  be  made  from  the 
distribution,  eq.  (8).  For  example,  one  can  compute 
the  marginal  probability  that  the  Markov  process  ter¬ 
minates  at  precisely  the  observed  time.  This  is  done 
by  summing  the  probabilities 


where  we  have  used  sq  and  s„+i  to  denote  the  START 
and  END  states  of  the  Markov  process.  The  first  prod¬ 
uct  in  eq.  (8)  multiplies  the  probabilities  that  each 
segment  traverses  exactly  its  observed  arc  length.  The 


Pr  [s(r)  =  END  1  x{t)]  = 
»(0 
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where  the  zero-one  weighting  factor  selects  out  only 
those  segmentations  that  terminate  precisely  at  time  r. 
Similarly,  one  can  compute  the  posterior  probability, 
Pr[s(ti)  =  s(r)  =  end],  that  at  an  earlier  mo¬ 

ment  in  time,  ti,  the  variable  s  was  in  state  i.  Both 
types  of  inferences  are  handled  by  discretizing  the  time 
axis  and  applying  a  dynamic  programming  procedure 
similar  to  eqs.  (10-11).  In  the  interest  of  brevity,  we  do 
not  give  the  details  of  these  constructions,  noting  only 
that  in  most  respects  they  are  completely  analogous 
to  the  ones  for  discrete-time  hidden  Markov  models 
(Rabiner  k  Juang,  1993). 

3  LEARNING  FROM  EXAMPLES 

In  this  section,  we  consider  how  to  learn  Markov  pro¬ 
cesses  of  the  form,  eq.  (6).  By  learning,  we  mean  how 
to  estimate  the  parameters  {A,,  o,j,jr,:(.T)}  from  exam¬ 
ples  of  segmented  (or  non-segmented)  curves.  Our  first 
step  is  to  assume  a  convenient  parameterization  for  the 
matrices,  gi{x),  that  compute  arc  lengths.  We  then 
show  how  to  fit  these  matrices,  along  with  the  param¬ 
eters  A,-  and  Oij,  by  maximum  likelihood  estimation. 

A  variety  of  parameterizations  can  be  considered  for 
the  matrices,  gi(x).  In  this  paper,  we  consider  the  very 
simple  form: 

gi{x)  =  |(x  -  Hif  Ef'  (a;  -  (13) 

where  the  parameters  /X;,  E,  and  cr,  are  set  by  max¬ 
imum  likelihood  estimation.  Here,  E,  and  <t,  are 
positive-definite  D  x  D  square  matrices,  while  /a,  is 
a  P-dimensional  vector.  We  also  impose  the  deter¬ 
minant  constraint  |E,||o-,|5  =  1;  this  eliminates  the 
degenerate  solution,  gi{x)  =  0,  in  which  every  tra¬ 
jectory  is  assigned  zero  arc  length.  Note  that  there 
remains  an  artificial  degree  of  freedom  associated  with 
simultaneously  rescaling  E,-  and  cr,-. 

The  form  of  eq.  (13)  is  designed  to  endow  each  state 
with  a  characteristic  signature.  In  particular,  consider 
the  differential  arc  lengths  that  appear  in  eq.  (6): 

.1  ,  ,  i 

x^gi{x)x  =  (a;  - /Lti)^E]“'(x  - /:t,)  x^cr“’x 

If  X  is  close  to  fii,  then  both  the  arc  length  and  the 
corresponding  probability  of  decay  (out  of  state  i)  are 
small.  Each  state  is  therefore  characterized  by  the 
values  of  x  that  allow  it  to  persist.  Intuitively,  the  pa¬ 
rameters  /X,-  can  be  viewed  as  target  vectors  associated 
with  each  state  of  the  Markov  process.  Typical  de¬ 
viations  about  /X;  are  encoded  by  E,-  and  ai.  In  what 
follows,  we  show  how  to  learn  the  parameters  that  best 
characterize  each  state. 


3.1  LABELED  EXAMPLES 


Suppose  we  arc  given  examples  of  segmented  trajecto¬ 
ries,  {x„(/),  So(t)},  where  the  index  a  runs  over  the 
example  in  the  training  set.  As  shorthand,  let  6,„(/) 
denote  the  indicator  function  that  selects  out  segments 
a.ssociatcd  with  state  i: 


Sicit) 


1  if  Sa{t)  -  i, 
0  otherwise. 


(14) 


Also,  let  (ia  denote  the  total  arc  length  traversed  by 
state  i  in  the  evth  example: 

•  (I^) 

In  this  paper  we  view  learning  as  a  problem  in  maxi¬ 
mum  likelihood  estimation.  Thus  we  seek  the  param¬ 
eters  that  maximize  the  conditional  log-likelihood: 

Prfs^lxo]  -  XiEir,  +  Y  ^ij  ln[Ai«o], 

<3r  ia  ij 

(16) 

where  is  the  overall  number  of  observed  transitions 
from  state  i  to  state  j.  The  first  term  in  eq,  (16) 
measures  the  log-likelihood  of  observed  segments  in 
isolation,  while  the  second  measures  the  log-likelihood 
of  observed  transitions. 


Eq.  (16)  has  a  convenient  form  for  maximum  likeli¬ 
hood  estimation.  In  particular,  there  are  closed-form 
solutions  for  the  values  of  A,-  and  0,^  that  maximize 
this  log-likelihood;  they  are  given  by: 

aij  =  riij/m,  (17) 

V’  =  (18) 

a 

where  n,-  =  ■  In  general,  we  cannot  find  clo.scd- 

form  solutions  for  the  maximum-likelihood  estimates 
of  E,-,  cr,}.  However,  we  can  update  the.se  param¬ 

eters  in  an  iterative  fashion  that  is  guaranteed  to  in¬ 
crease  the  log-likelihood  at  each  step.  Denoting  the 
updated  parameters  by  {/x,-,  E,,  o-;},  we  consider  the 
iterative  scheme  (derived  in  the  appendix): 

1 

A.'  -  - L (19) 


T,afdi  Sic  \xl  (T.  '  X„]  " 


(20) 
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where  the  constant  c,  is  determined  by  the  determi¬ 
nant  constraint  |Si||d'j|2  =  1  and  we  have  introduced 
the  shorthand  notation, 

Aia{t)  =  Xa{t)-  p.i,  (22) 

for  the  difference  between  Xa{t)  and  its  (re-estimated) 
target  value  in  state  i.  Note  that  all  the  variables  in 
eqs.  (19-21)  with  the  subscript  a  have  an  implicit  time 
dependence. 

Some  intuition  for  the  form  of  these  updates 
can  be  gained  by  considering  the  points  dis¬ 
tributed  along  Xa{t),  as  weighted  by  the  measure 
5,-„(t)[a!^crr^®]5.  The  updates  for  and  Si  sim¬ 
ply  compute  the  mean  and  covariance  of  this  distribu¬ 
tion.  The  update  for  Ci  has  a  similar  interpretation, 
though  its  derivation  relies  on  the  introduction  of  an 
auxiliary  function,  Q{a’i,cri),  as  in  the  Expectation- 
Maximization  (EM)  procedure  (Dempster,  Laird,  & 
Rubin,  1977).  Note  that  it  is  important  to  perform 
the  updates  in  the  order  shown,  since  (for  example) 
the  E-update  depends  on  the  re-estimated  value  of  p.. 
By  taking  gradients  of  eq.  (16),  one  can  show  that  the 
fixed  points  of  this  iterative  procedure  correspond  to 
stationary  points  of  the  log-likelihood.  A  proof  sketch 
of  monotonic  convergence  is  given  in  the  appendix. 

In  the  case  of  labeled  examples,  the  above  proce¬ 
dures  for  maximum  likelihood  estimation  can  be  in¬ 
voked  independently  for  each  state  i.  One  first  iterates 
eqs.  (19-21)  to  estimate  the  parameters  that  determine 
gi{x).  These  parameters  are  then  used  to  compute 
the  arc  lengths,  iia,  that  appear  in  eq.  (15).  Given 
these  arc  lengths,  the  decay  parameters  and  transition 
probabilities  follow  directly  from  eqs.  (17-18).  Thus 
the  problem  of  learning  given  labeled  examples  is  rel¬ 
atively  straightforward. 

3.2  UNLABELED  EXAMPLES 

In  this  section  we  consider  the  problem  of  unsuper¬ 
vised  learning.  In  this  setting,  the  learner  does  not 
have  access  to  labeled  examples;  the  only  available  in¬ 
formation  consists  of  the  trajectories  Xa{t),  as  well  as 
the  fact  that  each  process  terminates  at  some  time  r^. 
The  goal  of  unsupervised  learning  is  to  maximize  the 
conditional  log-likelihood, 

^lnPr[s„('rc<)  =  END  I  *„(<)],  (23) 

a 

that  for  each  trajectory  x„(t),  some  probable  segmen¬ 
tation  can  be  found  that  terminates  at  precisely  the 
observed  time.  The  marginal  probabilities  in  eq.  (23) 


are  computed  by  summing  Pr[s(t)l®(t)]  over  allowed 
segmentations,  as  in  eq.  (12). 

The  maximization  of  this  log-likelihood  defines  a  prob¬ 
lem  in  hidden  variable  density  estimation.  The  hidden 
variables  are  the  states  of  the  Markov  process.  If  these 
variables  were  known,  the  problem  would  reduce  to  the 
one  considered  in  the  previous  section.  To  fill  in  these 
missing  values,  we  avail  ourselves  of  the  Expectation- 
Maximization  (EM)  algorithm  (Baum,  1972;  Demp¬ 
ster,  Laird,  &  Rubin,  1976).  Roughly  speaking,  the 
EM  algorithm  works  by  converting  the  maximiza¬ 
tion  of  eq.  (23)  into  a  weighted  version  of  the  prob¬ 
lem  where  the  segmentations,  Sa{t),  are  known.  The 
weights  are  determined  by  the  posterior  probabilities, 
Tr[so,{t)\xc,it),Sa{Ta)  =  end],  derived  from  the  cur¬ 
rent  parameter  estimates. 

In  the  interest  of  brevity,  we  do  not  give  a  detailed 
account  of  the  fnll  EM  algorithm  for  MPCs.  We  note, 
however,  that  eqs.  (10-11)  by  themselves  suffice  to  im¬ 
plement  a  very  good  approximation  to  the  full  proce¬ 
dure.  This  approximation  is  to  compute,  based  on 
the  current  parameter  estimates,  the  optimal  segmen¬ 
tation,  s*  (t),  for  each  trajectory  in  the  training  set; 
one  then  re-estimates  the  parameters  of  the  Markov 
process  by  treating  the  inferred  segmentations,  s*  (t), 
as  targets.  This  approximation  reduces  the  problem 
of  parameter  estimation  to  the  one  considered  in  the 
previous  section.  It  can  be  viewed  as  a  winner-take-all 
approximation  to  the  full  EM  algorithm,  analogous  to 
the  Viterbi  approximation  for  hidden  Markov  models 
(Rabiner  &  Juang,  1993). 

Essentially  the  same  algorithm  can  also  be  applied 
to  the  intermediate  case  of  partially  labeled  examples. 
Suppose,  for  example,  that  the  learner  has  access  to 
labeled  state  sequences  but  not  to  segmented  curves; 
in  other  words,  examples  are  provided  in  the  form; 

{start  (si,?)  •  •  •  (Sn,?)  end}.  (24) 

The  ability  to  handle  such  examples  is  important  for 
two  reasons:  first,  because  they  provide  significantly 
more  information  than  unlabeled  examples,  and  sec¬ 
ond,  because  they  are  often  much  cheaper  to  generate 
than  fully  segmented  curves.  As  before,  we  can  view 
the  learning  problem  for  these  examples  as  one  in  hid¬ 
den  variable  density  estimation.  In  this  case,  the  hid¬ 
den  variables  are  not  the  states  of  the  Markov  process 
per  se,  but  only  the  times  at  which  they  change.  We 
can  incorporate  knowledge  of  the  state  sequence  into 
the  EM  algorithm  simply  by  restricting  the  sums  over 
paths  in  eqs.  (10)  and  (12)  to  those  that  pass  through 
the  desired  sequence. 
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4  AUTOMATIC  SPEECH 
RECOGNITION 


The  Markov  processes  in  this  paper  were  conceived 
as  models  for  automatic  speecli  recognition  (Rabiner 
&  Juang,  1993).  Speecli  recognizers  take  as  input  a 
sequence  of  feature  vectors,  eacli  of  which  encodes  a 
short  window  of  speech.  Acoustic  feature  vectors  typi¬ 
cally  have  ten  or  more  components,  so  tliat  a  particular 
sequence  of  feature  vectors  can  be  viewed  as  tracing 
out  a  multidimensional  curve.  The  goal  of  a  speech 
recognizer  is  to  translate  tliis  curve  into  a  sequence  of 
words,  or  more  generally,  a  sequence  of  sub-syllabic 
units  known  as  phonemes.  Denoting  the  feature  vec¬ 
tors  by  Xt  and  the  phonemes  by  St,  we  can  view  this 
problem  as  the  discrete-time  equivalent  of  the  segmen¬ 
tation  problem  in  MFCs. 

Why  consider  MFCs  as  models  of  speech  recognition? 
Hidden  Markov  models  (HMMs),  the  current  leading 
technology,  are  also  based  on  probabilistic  methods. 
These  models  manipulate  joint  distributions  of  the 
form; 


Fr[s,a:]  =  JJ  Fr[s(  |s,_i]  Fr[a;,|s(].  (25) 

t 

Though  HMMs  have  led  to  significant  advances  in 
speech  recognition,  they  are  handicapped  by  certain 
weaknesses.  One  of  these  is  the  poor  manner  in  which 
they  handle  variations  in  speaking  rate.  Intuitively,  we 
can  represent  these  variations  by  nonlinear  warpings  of 
time.  For  example,  consider  the  pair  of  trajectories  X( 
and  j/j,  where  y^  is  created  by  the  doubling  operation: 


r  X(/2  if  t  even, 
\  Vt-i  if  f 


(26) 


Both  trajectories  trace  out  the  same  curve,  but  y^  does 
so  at  half  the  rate  as  Xj.  Hidden  Markov  models  will 
not  assign  these  trajectories  the  same  likelihood,  nor 
are  they  guaranteed  to  infer  equivalent  segmentations. 
This  example  shows  that  HMMs  do  not  even  approx¬ 
imately  capture  the  invariances  modeled  by  MFCs  or 
other  arc-length  based  descriptions  of  speech  (Tishby, 
1990). 


Admittedly,  the  warping  in  eq.  (26)  represents  a  highly 
idealized  picture  of  acoustic  variability.  Nevertheless, 
there  is  a  great  deal  of  empirical  evidence  that  HMMs 
suffer  from  the  inability  to  model  variations  in  speak¬ 
ing  rate  (Siegler  &  Stern,  1995).  For  example,  word 
error  rates  increase  dramatically  when  one  moves  from 
scripted  to  spontaneous  speech.  Also,  one  generally 
observes  that  consonants  are  more  frequently  botched 


than  vowels.  The  reason  is  that  in  HMMs,  the  contri¬ 
bution  of  particular  states  to  the  overall  log-likelihood 
is  in  direct  proportion  to  their  duration.  Thus  training 
procedures  designed  to  maximize  the  log-likelihood  arc 
inherently  biased  to  model  long-lived  phonemes  (i.e., 
vowels)  more  accurately  than  short-lived  ones. 

MFCs  are  quite  different  from  HMMs  in  this  respect. 
In  MFCs,  the  contribution  of  each  state  to  the  log- 
likelihood  is  determined  by  its  arc  length.  The  weight¬ 
ing  by  arc  length  attaches  a  more  important  role  to 
short-lived  but  non-slalionary  phonemes.  Of  course, 
one  can  imagine  heuristics  in  HMMs  that  achieve  the 
same  effect,  such  as  dividing  each  state’s  contribu¬ 
tion  to  the  log-likelihood  by  its  observed  (or  inferred) 
duration.  Unlike  such  heuristics,  however,  the  state- 
dependent  metric  g(x)  in  MFCs  is  learned  from  data; 
in  particular,  it  is  designed  to  reweight  the  speech  sig¬ 
nal  in  a  way  that  reflects  the  actual  statistics  of  acoins- 
tic  trajectories. 

So  far  we  have  emphasized  the  invariance  to  non¬ 
linear  warpings  of  time  as  the  main  difference  be¬ 
tween  MFCs  and  HMMs.  Another  important  differ¬ 
ence,  however,  lies  in  what  each  tries  to  model.  While 
MFCs  attempt  to  model  the  conditional  distribution 
Fr[s|x],  HMMs  attempt  to  model  the  joint  distribu¬ 
tion,  Fr[s,x].  Only  the  former  is  required  for  speech 
recognition,  yet  HMMs  attempt  something  much  more 
ambitious  by  learning  a  generaiive  model  of  acoustic 
trajectories.  Maximum  likelihood  training  in  HMMs 
is  designed  to  increase  the  likelihood  of  observed  tra¬ 
jectories,  Fr[x].  Unfortunately,  because  HMMs  do  not 
represent  the  true  model  of  speech,  maximizing  this 
likelihood  does  not  always  translate  into  minimizing 
error  rates.  These  issues  point  to  yet  another  differ¬ 
ence  between  MFCs  and  HMMs.  Learning  in  MFCs 
is  directed  at  learning  a  recognition  model,  Fr[s|x],  as 
oppo.scd  to  a  synthesis  model,  Fr[x|s].  The  direction  of 
conditioning  is  a  crucial  difference  between  maximum 
likelihood  estimation  in  MFCs  and  HMMs. 

In  terms  of  previous  work,  our  motivation  for  MFCs 
most  closely  resembles  that  of  Tishby  (1990),  who 
stressed  the  importance  of  invariance  to  nonlinear 
warpings  of  time  as  a  mathematical  symmetry.  In  that 
MFCs  stress  the  continuous  nature  of  the  speech  sig¬ 
nal,  they  also  bear  some  resemblance  to  so-called  seg¬ 
mental  acovstic  models  (Ostendorf,  Digalakis,  k  Kim¬ 
ball,  1996)  of  speech.  Unlike  HMMs,  segmental  acous¬ 
tic  models  enforce  the  constraint  that  acoustic  feature 
vectors  within  the  same  phonemic  state  trace  out  a 
continuous  trajectory.  Despite  this  shared  emphasis 
on  continuity,  however,  segmental  models  and  MFCs 
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differ  in  fundamental  respects.  In  particular,  segmen¬ 
tal  models  incorporate  the  constraint  of  continuity  by 
building  a  more  complicated  synthesis  model  Pr[a;|s]  of 
acoustic  trajectories.  They  retain,  however,  the  usual 
Markov  assumption  between  states; 

Pr[s<|st_i,S4_2,  •  •  =  Pr[st|st_i].  (27) 

By  contrast,  MPCs  build  a  recognition  model  Pr[slx] 
whose  very  definition  is  conditioned  on  the  existence 
of  a  continuous  trajectory.  Moreover,  the  Markov  as¬ 
sumption  in  MPCs — as  embodied  by  eq.  (6) — is  con¬ 
ditioned  on  the  current  position  and  tangent  vector 
of  the  acoustic  feature  trajectory.  This  differs  from 
the  Markov  assumption  in  eq.  (27),  which  is  made  in¬ 
dependent  of  (or  unconditioned  on)  the  acoustic  fea¬ 
tures.  Finally,  to  the  best  of  our  knowledge,  MPCs  are 
novel  in  two  key  respects:  the  formulation  of  a  warp- 
invariant  probabilistic  model  explicitly  in  terms  of  arc 
length,  and  the  emphasis  on  learning  a  metric  g{x)  for 
each  hidden  state  of  the  Markov  process.  These  ideas 
differentiate  MPCs  from  segmental  acoustic  models  as 
well  as  ordinary  HMMs. 

The  starting  point  of  this  work  was  to  postulate  eq.  (1) 
as  an  invariance  of  random  processes.  Of  course, 
it  would  be  naive  to  expect  speech  signals  to  ex¬ 
hibit  a  strict  invariance  to  nonlinear  warpings  of  time. 
The  acoustic  realization  of  a  phoneme  does  depend 
to  some  extent  on  the  speaking  rate,  and  certain 
phonemes  are  more  likely  to  be  stretched  or  short¬ 
ened  than  others.  To  accommodate  this,  one  can  relax 
the  warping  invariance  in  MPCs.  This  is  most  easily 
done  by  building  models  of  the  space-time^  trajecto¬ 
ries  X(t)  =  {x{t),t}  and  computing  generalized  arc 

lengths,  dL  =  [x'^GjX)  X]^ dt,  where  X  =  {x,l} 
and  G{X)  is  a  space-time  metric.  The  effect  of  replac¬ 
ing  i  by  X  is  to  allow  each  acoustic  feature  vector  to 
contribute  a  finite  amount  to  the  overall  log-likelihood 
even  when  |a:|  is  zero — that  is,  even  when  it  represents 
a  perfectly  stationary  frame  of  speech. 

We  are  currently  evaluating  MPCs  as  engines  for  au¬ 
tomatic  speech  recognition.  Naturally,  we  expect  that 
many  further  elaborations  will  be  required  to  surpass 
the  finely  tuned  performance  of  modern  recognizers. 
These  may  include  more  sophisticated  parameteriza- 
tions  of  the  metric  gi{x),  the  use  of  information  from 
higher  order  derivatives  (e.g.,  x  and  x),  and/or  tran¬ 
sition  probabilities  ajj(x)  that  vary  along  the  length 

^The  admixture  of  space  and  time  coordinates  in  this 
way  is  an  old  idea  from  physics,  originating  in  the  theory 
of  relativity  (Einstein,  1924)  (though  in  that  context  the 
metric  is  negative-definite). 


of  the  curve.  Nevertheless,  we  hope  that  this  paper 
serves  to  introduce  the  basic  principles  of  MPCs,  as 
well  as  to  suggest  an  intriguing  departure  from  tradi¬ 
tional  methods  in  automatic  speech  recognition. 


A  REESTIMATION  FORMULAS 


In  this  appendix  we  derive  the  reestimation  formulas, 
eqs.  (19-21)  and  show  that  they  lead  to  monotonic 
increases  in  the  log-likelihood,  eq.  (16). 

We  begin  by  examining  a  simpler  problem.  Let 
{x(f)|f  €  [0,r]}  denote  a  D-dimensional  trajectory, 
and  let  $(x)  >  0  denote  an  everywhere  non-negative 
function  of  x.  Now  consider  the  function; 


£((r)  z=  f  dt 

Jo 


T  -1  • 
X  a  X 


$(x(f)),  (28) 


where  a  is  a,  D  x  D  positive-definite  matrix.  The  right 
hand  side  of  eq.  (28)  clearly  depends  on  the  trajectory 
x(t)  and  the  function  $(x),  but  for  now  let  us  regard 
both  of  these  as  fixed  and  consider  £{(t)  simply  eis  a 
function  of  the  matrix  a. 


Since  o  is  positive-definite  and  $(x)  >  0,  we  immedi¬ 
ately  observe  that  the  function  £{cr)  is  bounded  below 
by  zero.  Let  us  consider  how  to  find  the  value  of  cr 
that  minimizes  £{ct),  subject  to  the  determinant  con¬ 
straint  1<t|  =  1.  Note  that  the  matrix  elements  of  cr"^ 
appear  nonlinearly  in  the  right  hand  side  of  eq.  (28); 
thus  it  is  not  possible  to  compute  their  optimal  val¬ 
ues  in  closed  form.  As  an  alternative,  we  consider  the 
auxiliary  function: 


Q{P>  <r)=  f  dt 

Jo 


x^ p  ^x 
[x^a~^x]^ 


+ 


•  T  -1  • 
X  (7  X 

1 

(29) 

where  p  is  a  D  x  D  positive-definite  matrix  like  a. 
It  follows  directly  from  the  definition  in  eq.  (29)  that 
£[a)  =  Q{(r,a).  Somewhat  less  trivially,  we  observe 
that  Q{p,  p)  <  Q{p,  er)  for  all  positive  definite  matrices 
p  and  cr.  This  inequality  follows  from  the  concavity  of 
the  square  root  function,  as  illustrated  in  figure  2. 


Consider  the  value  of  p  which  minimizes  Q{p,  cr),  sub¬ 
ject  to  the  determinant  constraint  |p|  =  1.  We  denote 
this  value  by  d  =  min|p|=i  Q{p,a).  Because  the  ma¬ 
trix  elements  of  p~^  appear  linearly  in  Q{p,cr),  this 
minimization  essentially  reduces  to  computing  the  co- 
variance  matrix  of  the  tangent  vector  x,  as  distributed 
along  the  trajectory  x{t).  In  particular,  we  have: 


a  oc 


$(x(t)). 


(30) 
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Figure  2:  The  square  root  function  is  concave  and  up¬ 
per  bounded  by  <  \[z / y/i y/l]  for  all  ^  >  0.  The 
bounding  tangents  are  shown  for  ^  ^  and  ^  =  1. 

where  the  constant  of  proportionality  is  determined  by 
the  constraint  |5-|  =  1.  To  minimize  t(a)  with  respect 
to  (T,  we  now  consider  the  iterative  procedure  where  at 
each  step  we  replace  a  by  a.  We  observe  that: 

^(cr)  =  Q(a,a) 

<  Q{^,  <t)  due  to  concavity 

<  (5(cr,  cr)  since  cr  =  minp  Q(/5,  (t) 

with  equality  generally  holding  only  when  5-  =  cr.  In 
other  words,  this  iterative  procedure  converges  mono- 
ionically  to  a  local  minimum  of  l{cr). 

Let  us  now  relate  the  problem  of  minimizing  £{(t)  to 
the  original  problem  of  maximizing  the  likelihood  in 
eq.  (16).  There  we  saw  that  for  each  state  of  the  MFC, 
it  was  necessary  to  optimize  the  parameters  {fi,  E,  cr}. 
Here,  for  notational  convenience,  we  have  dropped  the 
subscript  denoting  the  state  index  of  these  parameters. 
Note  that  in  terms  of  these  parameters,  maximizing 
each  state’s  contribution  to  the  log-likelihood  is  equiv¬ 
alent  to  minimizing  the  total  arc  length  of  its  segments 
in  the  training  set.  This  problem  can  be  viewed  as  a 
particular  instance  of  the  one  considered  above,  pro¬ 
vided  that  we  make  the  identification: 

$(x)  =  (*  - /j)^E”^(x  -  Ai).  (31) 

Of  course,  now  in  addition  to  minimizing  the  arc  length 
with  respect  to  cr,  we  must  also  optimize  the  values  of 
fi  and  E.  To  this  end,  note  that  eq.  (31)  defines  a 
standard  quadratic  form;  hence  for  fixed  a,  the  values 
of  (ji  and  E  that  minimize  eq.  (28)  are  given  simply 
by  the  mean  and  covariance  matrix  of  the  points  x{t) 


along  each  state’s  segments,  as  weighted  by  the  mea¬ 
sure  [i^cr“’x]5.  Within  each  state,  we  thus  obtain  a 
monotonically  convergent  learning  procedure  by  alter¬ 
nately  optimizing  fx  and  E  for  fixed  cr,  then  optimizing 
cr  for  fixed  /i  and  E.  This  leads  directly  to  the  reesti¬ 
mation  formulas  in  eqs.  (19-21). 
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Abstract 


In  this  paper  we  study  a  dual  version  of  the 
Ridge  Regression  procedure.  It  allows  us  to 
perform  non-linear  regression  by  construct¬ 
ing  a  linear  regression  function  in  a  high  di¬ 
mensional  feature  space.  The  feature  space 
representation  can  result  in  a  large  increase 
in  the  number  of  parameters  used  by  the  al¬ 
gorithm.  In  order  to  combat  this  “curse  of 
dimensionality” ,  the  algorithm  allows  the  use 
of  kernel  functions,  as  used  in  Support  Vector 
methods.  We  also  discuss  a  powerful  family 
of  kernel  functions  which  is  constructed  using 
the  ANOVA  decomposition  method  from  the 
kernel  corresponding  to  splines  with  an  infi¬ 
nite  number  of  nodes.  This  paper  introduces 
a  regression  estimation  algorithm  which  is 
a  combination  of  these  two  elements:  the 
dual  version  of  Ridge  Regression  is  applied 
to  the  ANOVA  enhancement  of  the  infinite- 
node  splines.  Experimental  results  are  then 
presented  (based  on  the  Boston  Housing  data 
set)  which  indicate  the  performance  of  this 
algorithm  relative  to  other  algorithms. 


1  INTRODUCTION 

First  of  all,  let  us  formulate  regression  estimation  prob¬ 
lem.  Suppose  we  have  a  set  of  vectors^  , . . . ,  xt,  and 
we  also  have  a  supervisor  which  gives  us  a  real  value 
yt,  for  each  of  the  given  vectors.  Our  problem  is  to 
construct  a  learning  machine  which  when  given  a  new 

^We  will  use  subscripts  to  indicate  a  particular  vector 
(e.g.  Xt  is  the  tth  vector),  and  superscripts  to  indicate  a 
particular  vector  element  (e.g  x*  is  the  ith  element  of  the 
vector  x). 


set  of  examples,  minimises  some  measure  of  discrep¬ 
ancy  between  its  prediction  yt  and  the  value  of  yt-  The 
measure  of  loss  which  we  are  using,  average  square  loss 
(L),  is  defined  by 


L=jY^iyt-yi?, 


t=i 


where  yt  are  the  supervisor’s  answers,  yt  are  the  pre¬ 
dicted  values,  and  I  is  the  number  of  vectors  in  the  test 
set. 


Least  Squares  and  Ridge  Regression  are  classical  sta¬ 
tistical  algorithms  which  have  been  known  for  a  long 
time.  They  have  been  widely  used,  and  recently  some 
papers  such  as  Drucker  et  al.  [2]  have  used  regres¬ 
sion  in  conjunction  with  a  high  dimensional  feature 
space.  That  is  the  original  input  vectors  are  mapped 
into  some  feature  space,  and  the  algorithms  are  then 
used  to  construct  a  linear  regression  function  in  the 
feature  space,  which  represents  a  non-linear  regression 
in  the  original  input  space.  There  is,  however,  a  prob¬ 
lem  encountered  when  using  these  algorithms  within 
a  feature  space.  Very  often  we  have  to  deal  with  a 
very  large  number  of  parameters,  and  this  leads  to  se¬ 
rious  computational  difficulties  that  can  be  impossible 
to  overcome.  In  order  to  combat  this  “curse  of  dimen¬ 
sionality”  problem,  we  describe  a  dual  version  of  the 
Least  Squares  and  Ridge  Regression  algorithms,  which 
allows  the  use  of  kernel  functions.  This  approach  is 
closely  related  to  Vapnik’s  kernel  method  as  used  in 
the  Support  Vector  Machine.  Kernel  functions  repre¬ 
sent  dot  products  in  a  feature  space,  which  allows  the 
algorithms  to  be  used  in  a  feature  space  without  having 
to  carry  out  computations  within  that  space.  Kernel 
functions  themselves  can  take  many  forms  and  partic¬ 
ular  attention  is  paid  to  a  family  of  kernel  functions 
which  are  constructed  using  ANOVA  decomposition 
(Vapnik  [10];  see  also  Wahba  [11,  12]).  There  are  two 
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major  objectives  of  this  paper: 

1.  To  show  how  to  use  kernel  functions  to  overcome 
the  curse  of  dimensionality  in  the  above  men¬ 
tioned  algorithms. 

2.  To  demonstrate  how  ANOVA  decomposition  ker¬ 
nels  can  be  constructed,  and  evaluate  their  perfor¬ 
mance  compared  to  polynomial  and  spline  kernels, 
on  a  real  world  data  set. 

Results  from  experiments  performed  on  the  well  known 
Boston  housing  data  set  are  then  used  to  show  that  the 
Least  Squares  and  Ridge  Regression  algorithms  per¬ 
form  well  in  comparison  with  some  other  algorithms. 
The  results  also  show  that  the  ANOVA  kernels,  which 
only  consider  a  subset  of  the  input  parameters,  can  im¬ 
prove  on  results  obtained  on  the  same  kernel  function 
without  the  ANOVA  technique  applied.  In  the  next 
section  we  present  the  dual  form  of  Least  Squares  and 
Ridge  Regression. 

2  RIDGE  REGRESSION  IN  DUAL 
VARIABLES 

Before  presenting  the  algorithms  in  dual  variables,  the 
original  formulation  of  Least  Squares  and  Ridge  Re¬ 
gression  is  stated  here  for  clarity. 

Suppose  we  have  a  training  set  (xi, j/i ),..., (xr.yr), 
where  T  is  the  number  of  examples,  xt  are  vectors 
in  M"  (n  is  the  number  of  attributes)  and  yt  G  IR, 
t  =  1, . . . ,  T.  Our  comparison  class  consists  of  the 
linear  functions  y  =  w  ■  x,  where  w  G  H". 

The  Least  Squares  method  recommends  computing 
w  =  Wo  which  minimizes 

T 

Lt{w)  =  '^{yt  -  w  ■  xt)"^ 

t=i 

and  using  wo  for  labeling  future  examples:  if  a  new 
example  has  attributes  x,  the  predicted  label  is  wo  -x. 

The  Ridge  Regression  procedure  is  a  slight  modifica¬ 
tion  on  the  least  squares  method  and  replaces  the  ob¬ 
jective  function  Lt{w)  by 


(LS)  as  a  special  case.  In  this  derivation  we  partially 
follow  Vapnik  [8].  We  start  with  re-expressing  our 
problem  as:  minimize  the  expression 

t=i 


under  the  constraints 


yt-w-xt=tt,  t=l,...,T.  (2) 

Introducing  Lagrange  multipliers  aj,  t  =  1, . . .  ,r,  we 
can  replace  our  constrained  optimization  problem  by 
the  problem  of  finding  the  saddle  point  of  the  function 

T  T 

ollw'll^  +  '^^t  +'^0!t{yt-w-xt-^t).  (3) 

t=i  t=i 


In  accordance  with  the  Kuhn — Tucker  theorem,  there 
exist  values  of  Lagrange  multipliers  a  =  for  which 
the  minimum  of  (3)  equals  the  minimum  of  (1),  under 
constraints  (2).  To  find  the  optimal  w  and  we  will  do 
the  following;  first,  minimize  (3)  in  w  and  ^  and  then 
maximize  it  in  a.  Notice  that  for  any  fixed  values  of 
a  the  minimum  of  (3)  (in  w  and  is  less  than  or 
equal  to  the  value  of  the  optimization  problem  (1)- 
(2),  and  equality  is  attained  when  a  =  By  doing 
this,  we  will  therefore  find  the  solution  to  our  original 
constrained  minimization  problem  (l)-(2). 

Differentiating  (3)  in  w,  we  obtain  the  condition 

T 

2aw  -  ^  atXt  =  0, 

(=1 


i.e.. 


1  ^ 
2a  4-' 


atXt. 


t=i 


(4) 


(Lagrange  multipliers  are  usually  interpreted  as  re¬ 
flecting  the  importance  of  the  corresponding  con¬ 
straints,  and  equation  (4)  shows  that  w  is  proportional 
to  the  linear  combination  of  xt ,  each  of  which  is  taken 
with  a  weight  proportional  to  its  importance.)  Substi¬ 
tuting  this  into  (3),  we  obtain 


■  a:^)^ 

t=i 

where  o  is  a  fixed  positive  constant. 

We  now  derive  a  “dual  version”  for  Ridge  Regression 
(RR);  since  we  allow  o  =  0,  this  includes  Least  Squares 


^  a,atixs-Xt)  +  Y^t 


8yt=\ 


t=l 


j/T  \  ^  ^ 

\t=i  /  \  t=i  /  t=i  t=i 
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^  T  T  T  T 

=  +  XI +  X “  X 

s,t=l  i=l  t=i  t=i 

(5) 

Differentiating  (5)  in  we  obtain 

6  =  ^,  t  =  (6) 

(i.e.,  the  importance  of  the  tth  constraint  is  pro¬ 
portional  to  the  corresponding  residual);  substitution 
into  (5)  gives 

T  ^  T  T 

-j- X  “*“*(^*  ”  iX“‘ 

s,t=l  ^  t=i  t=i 

Denoting  K  as  the  T  xT  matrix  of  dot  products 

^Syt  -  ^3  * 

and  differentiating  in  at,  we  obtain  the  condition 

which  is  equivalent  to 

a  =  2a{K  +  al)~^y. 

Recalling  (4),  we  obtain  that  the  prediction  y  given  by 
the  Ridge  Regression  procedure  on  the  new  unlabeled 
example  x  is 

atxt^  •  X  =  ^a  •  k  =  y'{K  +  al)~^k, 

where  k  =  {ki, . . . ,  kr)'  is  the  vector  of  the  dot  prod¬ 
ucts: 

kt  :=XfX,  t  =  l,...,T. 

Lemma  1  RR ’s  prediction  of  the  label  y  of  a  new  un¬ 
labeled  example  x  is 

y'{K  +  air^k,  (8) 

where  K  is  the  matrix  of  dot  products  of  the  vectors 
xi,. XT  in  the  training  set, 

Ks,t  =  lC{xs,xt),  s  =  l,...,T,  t  =  l,...,T, 

k  is  the  vector  of  dot  products  of  x  and  the  vectors  in 
the  training  set, 

kt  :=K{xt,x),  t=\,...,T, 

and  K[x,x')  =  x-x'  is  simply  a  function  which  returns 
the  dot  product  of  the  two  vectors,  x  and  x'. 


3  LINEAR  REGRESSION  IN 
FEATURE  SPACE 

When  K.{xi,Xj)  is  simply  a  function  which  returns  the 
dot  product  of  the  given  vectors,  formula  (8)  corre¬ 
sponds  to  performing  linear  regression  within  the  input 
space  ]R"  defined  by  the  examples.  If  we  want  to  con¬ 
struct  a  linear  regression  in  some  feature  space,  we  first 
have  to  choose  a  mapping  from  the  original  space  X 
to  a  higher  dimensional  feature  space  F  {(f  :  X  F). 
In  order  to  use  Lemma  1  to  construct  the  regression  in 
the  feature  space,  the  function  K,  must  now  correspond 
to  the  dot  product  (f){xi)  ■  4>{xj).  It  is  not  necessary  to 
know  (f>{x)  as  long  as  we  know  X{xi,Xj)  =  (j>{xi)-^{xj). 
The  question  of  which  functions  K,  correspond  to  a  dot 
product  in  some  feature  space  F  is  answered  by  Mer¬ 
cer’s  theorem  and  addressed  by  Vapnik  [9]  in  his  dis¬ 
cussion  of  support  vector  methods.  As  an  illustration 
of  the  idea,  an  example  of  a  simple  kernel  function 
is  presented  here.  (See  Girosi  [4].)  Suppose  there  is 
a  mapping  function  0  which  maps  a  two-dimensional 
vector  into  6  dimensions: 

(j>-.  {x^,x^)  !->•  ((ar^)2,(a:2)2^v^a;^^/2x^,V^a:^x^l), 
then  dot  products  in  F  take  the  form 
{^{x)  ■  4>iy)) 

=  +  {x^?{y‘^?  +  2x^y^ 

+2x^y'^  +  2x^y^x^y'^  4- 1 
=  i{x-y)  +  l-f- 

One  possible  kernel  function  is  therefore  ((x  •  y)  + 1)^. 
This  can  be  generalised  into  a  kernel  function  of  the 
form 

lC{x,y)  =  {{x-y)  +  1)^, 
and  more  than  2  dimensions. 

The  use  of  kernel  functions  allows  us  to  construct  a 
linear  regression  function  in  a  high  dimensional  feature 
space  (which  corresponds  to  a  non-linear  regression  in 
the  input  space)  avoiding  the  curse  of  having  to  carry 
out  computations  in  the  high  dimensional  space.  In 
particular,  kernel  functions  are  a  way  to  combat  the 
curse  of  dimensionality  problems  such  as  those  faced  in 
Drucker  et  al.  [2],  where  a  regression  function  was  also 
constructed  in  a  feature  space,  but  computations  were 
carried  out  in  the  high  dimensional  space,  leading  to 
huge  number  of  parameters  for  non-trivial  problems. 

For  more  information  on  the  kernel  technique,  see  Vap¬ 
nik  [8,  10,  9]  and  Wahba  [11]. 
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4  MULTIPLICATIVE  KERNELS 

Before  indicating  how  ANOVA  decomposition  can  be 
used  to  form  kernels,  a  brief  description  is  needed  of 
the  family  of  kernels  to  which  the  ANOVA  decompo¬ 
sition  can  be  applied,  this  being  the  family  of  multi¬ 
plicative  kernels.  This  refers  to  the  set  of  kernels  where 
the  multi-dimensional  case  is  calculated  as  the  prod¬ 
uct  of  the  one-dimensional  case.  That  is,  if  the  one¬ 
dimensional  case  is  A:(x®,j/*),  then  the  n-dimensional 
case  is 

n 

K^n{x,y)  = 

t=l 

One  such  kernel  (to  which  the  ANOVA  decomposition 
is  applied  here)  is  the  spline  kernel  with  an  infinite 
number  of  nodes  (see  Vapnik  [8,  10]  and  Kimeldorf 
and  Wahba  [5]).  A  spline  approximation  which  has  an 
infinite  number  of  nodes  can  be  defined  on  the  interval 
(0,  a),  0  <  a  <  00,  as  the  expansion 

fix)  =  f  a{t){x  -  t)+dt  -I-  V]  aix\ 

where  Oj,  i  =  0,...,d,  are  unknown  values,  and  a(t) 
is  an  unknown  function  which  defines  the  expansion. 
This  can  be  considered  as  an  inner  product,  and  the 
kernel  which  generates  splines  of  dimension  d  with  an 
infinite  number  of  nodes  can  be  expressed  as 

kd(x,y)=  [  {x -t)^{y -t)’\_dt  +  'y'x'^y'^. 

Note  that  when  t  >  min(a:,2/)  the  function  under  the 
integral  sign  will  have  value  zero.  It  is  therefore  suffi¬ 
cient  only  to  consider  the  interval  (0,  min(a;,  7/)),  which 
makes  the  formula  above  equivalent  to 

-  (t) 

kdix,  y)  =  '^  ^  ^  min(x,  y)^‘^  '■+'|x-i/r 

r=0 

d 

r=0 

In  particular,  for  the  case  of  linear  splines  {d  =  1)  we 
have  : 

1  /  N  ,  li  1  .  /  ^9  min(x,j/)^ 

ki{x,y)  =  l  +  xy  +  -\y  -  x\mm{x,y)^  -t- - 


5  ANOVA  DECOMPOSITION 
KERNELS 

The  ANOVA  decomposition  kernels  are  inspired  by 
their  namesake  in  statistics,  which  analyses  different 
subsets  of  variables.  The  actual  decomposition  can  be 
adapted  to  form  kernels  (as  in,  e.g.,  Vapnik  [10])  which 
involve  different  subsets  of  the  attributes  of  the  exam¬ 
ples  up  to  a  certain  size.  There  are  two  main  reasons 
for  choosing  to  use  ANOVA  decomposition.  Firstly, 
the  different  subsets  which  are  considered  may  group 
together  like  variables,  which  can  lead  to  greater  pre¬ 
dictive  power.  Also,  by  only  considering  some  subsets 
of  the  input  parameters,  ANOVA  decomposition  re¬ 
duces  the  VC  dimension  of  the  set  of  functions  that 
you  are  considering,  which  can  avoid  overfitting  your 
training  data. 

Given  a  one-dimensional  kernel  k,  the  ANOVA  kernels 
are  defined  as  follows: 

ICiix,y)=  k{x'‘,y’‘), 

l<k<n 

fC2(x,y)=  ^  k(x^',y*‘')k(x‘‘^,y^^), 

l<ki  <k2<n 

*  '  •  ) 

K:„(x,  y)  =  k{x^' ,  2/*' ) . . .  fc(a:*" ,  2/*" ). 

FVom  Vapnik  [10]  the  following  recurrent  procedure 
can  be  used  when  calculating  the  value  of  ICn{x,y). 
Let 

n 

IC%x,y)  =  J2i’^{x\y'))'> 
i=l 

and  ICo{x,y)  =  1;  then 

f7pix,y)=  ^  fc(x''^y''*)...fc(x''^^/'''’), 

l<ki<k2<-<kp<n 

}Cp{x,y)  =  if^(-l)''+^A:p_,(x,y);C»(x,y). 

^  «=i 

For  the  purposes  of  this  paper,  when  using  kernels  pro¬ 
duced  by  ANOVA  decomposition,  only  the  order  p  is 
considered: 

lC{x,y)  =  Kp{x,y). 

An  alternative  method  of  using  ANOVA  decomposi¬ 
tion  would  be  to  consider  order  p  and  all  lower  orders 
(as  in  Stitson  [7]),  i.e., 

^x,y)  =  Y^ICi{x,y). 

i=l 
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6  EXPERIMENTAL  RESULTS 

Experiments  were  conducted  on  the  Boston  Housing 
data  set^.  This  is  a  well  known  data  set  for  testing 
non-linear  regression  methods;  see,  e.g.,  Breiman  [1] 
and  Saunders  [6].  The  data  set  consists  of  506  cases 
in  which  12  continuous  variables  and  1  binary  vari¬ 
able  determine  the  median  house  price  in  a  certain 
area  of  Boston  in  thousands  of  dollars.  The  continuous 
variables  represent  various  values  pertaining  to  differ¬ 
ent  locational,  economic  and  structural  features  of  the 
house.  The  prices  lie  between  $5000  and  $50,000  in 
units  of  $1000.  Following  the  method  used  by  Drucker 
et  al.  [2],  the  data  set  was  partitioned  into  a  train¬ 
ing  set  of  401  cases,  a  validation  set  of  80  cases  and 
a  test  set  of  25  cases.  This  partitioning  was  carried 
out  randomly  100  times,  in  order  to  carry  out  100  tri¬ 
als  on  the  data.  For  each  trial  the  Ridge  Regression 
algorithm  was  applied  using: 

•  a  kernel  which  corresponds  to  a  spline  approxima¬ 
tion  with  an  infinite  number  of  nodes, 

•  the  same  kernel  but  with  the  AN OVA  decompo¬ 
sition  technique  applied, 

•  and  polynomial  kernels. 

For  each  kernel  the  set  of  parameters  (the  order  of 
spline/degree  of  polynomial  and  the  value  of  coeffi¬ 
cient  a)  was  selected  which  gave  the  smallest  error  on 
the  validation  set,  and  then  the  error  on  the  test  set 
was  measured.  This  experiment  was  then  repeated  us¬ 
ing  a  support  vector  machine  (SVM),  with  the  same 
kernels  and  exactly  the  same  100  training  files  (see 
Stitson  [7]  for  full  details).  As  an  illustration  of  the 
number  of  parameters  which  were  considered  by  the 
Ridge  Regression  Algorithm  (and  the  SVM),  consider 
the  polynomial  kernel  which  was  outlined  earlier,  us¬ 
ing  a  degree  of  5.  This  maps  the  input  vectors  into  a 
high  dimensional  feature  space  which  is  equivalent  to 
evaluating  13®  =  371,293  different  parameters. 

The  results  obtained  from  the  experiments  axe  shown 
in  Table  1.  The  measure  of  error  used  for  the  tests 
was  the  average  squared  error.  For  each  of  the  100 
test  files,  the  algorithm  was  run  and  the  square  of  the 
difference  between  the  predicted  and  actual  value  was 
taken.  This  was  then  averaged  over  the  25  test  cases. 
This  produces  an  average  error  for  each  of  the  100  test 

^Avciilable  by  anonymous  FTP  from: 
ftp://ftp.ics.uci.com/pub/ 
machine-learning-databases/housing. 


files,  and  an  average  of  these  were  taken,  which  pro¬ 
duces  the  final  error  which  is  quoted  in  the  3rd  column 
of  the  table.  The  variance  measure  in  the  table  is  the 
average  squared  difference,  between  the  squared  error 
measured  on  each  sample  and  the  average  squared  er¬ 
ror. 

There  are  two  additional  results  which  should  be  noted 
here.  One  is  from  Breiman  [1]  using  bagging  with  av¬ 
erage  squared  error  of  11.7,  and  one  from  Drucker  et 
al.  [2]  using  Support  Vector  regression  with  polynomial 
kernels  with  average  squared  error  of  7.2.  The  result 
obtained  by  Drucker  et  al.  is  slightly  better  than  the 
one  obtained  here  using  a  similar  machine;  this  may 
be,  however,  due  to  the  random  selection  of  the  train¬ 
ing,  validation  and  testing  sets. 

7  COMPARISONS 

In  this  section  we  will  give  a  comparison  of  the  results 
of  this  paper  with  the  known  results. 

7.1  SV  MACHINES 

In  this  subsection  we  describe  in  more  detail  the  con¬ 
nection  of  the  approach  of  this  paper  with  the  Support 
Vector  Machine. 

Our  optimization  problem  (minimizing  (1)  under  con¬ 
straints  (2))  is  essentially  a  special  case  of  the  following 
general  optimization  problem:  minimize  the  expres¬ 
sion 

i|i<»f+2(g({r)‘+|:(6)‘)  e>) 

under  the  constraints 

yt-w-xt<e  +  Q,  i  =  l,  ...,r,  (10) 

w-xt-yt<^  +  ^u  t  =  l,...,T;  (11) 

e  >  0  and  k  €  {1,2}  are  some  constants.  This  opti¬ 
mization  problem  (along  with  a  similar  problem  cor¬ 
responding  to  Huber’s  loss  function)  is  considered  in 
Vapnik  [10],  Chapter  11  (Vapnik,  however,  considers 
more  general  regression  functions  of  the  form  w-x  +  b 
rather  than  w  •  x;  the  difference  is  minor  because  we 
can  always  add  an  extra  attribute  which  is  always  1  to 
all  examples). 

Our  problem  (l)-(2)  corresponds  to  the  problem  (9)- 
(11)  with  fc  =  2,  e  =  0  and  C  =  1/a.  Vapnik  [10]  gives 
a  dual  statement  of  his,  and  a  fortiori  our,  problem;  he 
does  not  reach,  however,  the  closed-form  expression  (8) 
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Table  1:  Experimental  Results  on  the  Boston  Housing  Data 


METHOD 

KERNEL 

SQUARED  ERROR 

VARIANCE 

Ridge  Regression 

Polynomial 

10.44 

18.34 

Ridge  Regression 

Splines 

8.51 

11.19 

Ridge  Regression 

ANOVA  Splines 

7.69 

8.27 

SVM  [7] 

Polynomial 

8.14 

15.13 

SVM 

Splines 

7.87 

12.67 

SVM 

Anova  Splines 

7.72 

9.44 

(because  he  was  mainly  interested  in  positive  values  of 
e). 

As  we  mentioned  before,  our  derivation  of  formula  (8) 
follows  [8].  The  dual  Ridge  Regression  is  also  known  in 
traditional  statistics,  but  statisticians  usually  use  some 
clever  matrix  manipulations  rather  than  the  Lagrange 
method.  Our  derivation  (modelled  on  Vapnik’s)  gives 
some  extra  insight:  see,  e.g.,  equations  (4)  and  (6).  For 
an  excellent  survey  of  connections  between  Support 
Vector  Machine  and  the  work  done  in  statistics  we 
refer  the  reader  to  Wahba  [11,  12]  and  Girosi  [4]. 

7.2  KRIEGING 

Formula  (8)  is  well  known  in  the  theory  of  Krieging; 
in  this  subsection  we  will  explain  the  connection  for 
readers  who  are  familiar  with  Krieging.  Consider  the 
Bayesian  setting  where: 

•  the  vector  w  of  weights  is  distributed  according  to 
the  normal  distribution  with  mean  0  and  covari¬ 
ance  matrix  ^7; 

•  yt  =  w  ■xt  +  et,  t  =  1, . . .  ,T,  where  Ct  are  random 
variables  distributed  normally  with  mean  0  and 
variance  |. 

Then  the  optimization  problem  (1)  under  the  con¬ 
straints  (2)  becomes  the  problem  of  finding  the  pos¬ 
terior  mode  (which,  because  of  our  normality  assump¬ 
tion,  coincides  with  the  posterior  mean)  of  w,  there¬ 
fore,  formula  (8)  gives  the  mean  value  of  the  random 
variable  w  ■  x  (which  is  the  “clean  version”  of  the  label 
y  =  w  •  X  +  e  of  the  next  example).  Notice  that  the 
random  variables  yi, . . .  ,yT,w  ■  x  are  jointly  normal 
and  the  covariances  between  them  are 

cov{ys,yt)  =  cov{w-Xs  +  es,w-Xt+et)  =  +  ^ 

and 

cov{yt,'w  ■  x)  =  cov(w;  ■  Xt  ■  x)  =  -^{xt  ■  x). 


In  accordance  with  the  Krieging  formula  the  best  pre¬ 
diction  foTw-x  will  be 


which  coincides  with  (8). 

8  CONCLUSIONS 

A  formula  for  Ridge  Regression  (which  included  Least 
Squares  as  a  special  case)  in  dual  variables  was  de¬ 
rived  using  the  method  of  Lagrange  multipliers.  This 
was  then  used  to  perform  linear  regression  in  a  feature 
space.  Therefore,  we  once  more  showed  how  the  prob¬ 
lem  of  learning  in  a  very  high  dimensional  space  can 
be  solved  by  using  kernel  functions.  This  allowed  the 
algorithm  to  overcome  the  “curse  of  dimensionality” 
and  run  efficiently,  even  though  a  very  large  number 
of  parameters  were  being  considered.  Experimental  re¬ 
sults  show  that  Ridge  Regression  performs  well.  The 
results  also  indicate  that  applying  ANOVA  decompo¬ 
sition  to  a  kernel  can  achieve  better  results  than  using 
the  same  kernel  without  the  technique  applied.  Both 
Ridge  Regression  and  the  Support  Vector  method  gave 
a  smaller  error  when  using  ANOVA  splines  compared 
to  the  other  spline  kernel. 

A  weak  part  of  our  experimental  section  is  that, 
though  the  Boston  housing  data  is  a  useful  benchmark, 
we  have  not  applied  our  algorithm  to  a  wider  range  of 
practical  problems.  This  is  what  we  plan  to  do  next. 

In  order  to  confirm  that  ANOVA  kernels  can  outper¬ 
form  kernels  in  their  orginal  form,  the  ANOVA  de¬ 
composition  technique  should  be  applied  to  other  mul¬ 
tiplicative  kernels.  The  technique  of  applying  kernel 
functions  to  overcome  problems  of  high  dimensional¬ 
ity  should  also  be  investigated  futher,  to  see  if  it  can 
be  applied  to  any  other  algorithms  which  prove  com¬ 
putationally  difficult  or  impossible  when  faced  with  a 
large  number  of  parameters. 
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We  feel  that  a  very  interesting  direction  of  developing 
the  results  of  this  paper  would  be  to  combine  the  dual 
version  of  Ridge  Regression  with  the  ideas  of  Gam- 
merman  et  al.  [3]  to  obtain  a  measure  of  confidence 
for  predictions  output  by  our  algorithms.  We  expect 
that  in  this  case  simple  closed-form  formulas  can  be 
obtained. 
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Abstract 

Production  scheduling,  the  problem  of  se¬ 
quentially  configuring  a  factory  to  meet 
forecasted  demands,  is  a  critical  problem 
throughout  the  manufacturing  industry.  The 
requirement  of  maintaining  product  inven¬ 
tories  in  the  face  of  unpredictable  demand 
and  stochastic  factory  output  makes  stan¬ 
dard  scheduling  models,  such  as  job-shop, 
inadequate.  Currently  applied  algorithms, 
such  as  simulated  annealing  and  constraint 
propagation,  must  employ  ad-hoc  methods 
such  as  frequent  replanning  to  cope  with  un¬ 
certainty. 

In  this  paper,  we  describe  a  Markov  Deci¬ 
sion  Process  (MDP)  formulation  of  produc¬ 
tion  scheduling  which  captures  stochasticity 
in  both  production  and  demands.  The  solu¬ 
tion  to  this  MDP  is  a  value  function  which 
can  be  used  to  generate  optimal  scheduling 
decisions  online.  A  simple  example  illustrates 
the  theoretical  superiority  of  this  approach 
over  replanning-based  methods.  We  then  de¬ 
scribe  an  industrial  application  and  two  rein¬ 
forcement  learning  methods  for  generating  an 
approximate  value  function  on  this  domain. 

Our  results  demonstrate  that  in  both  deter¬ 
ministic  and  noisy  scenarios,  value  function 
approximation  is  an  effective  technique. 

1  Introduction 

Production  scheduling  is  a  critical  problem  through¬ 
out  the  manufacturing  industry.  In  this  paper,  we  ar¬ 
gue  that  in  order  to  deal  with  uncertainty  in  factory 
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production  and  demands,  a  Markov  Decision  Process 
(MDP)  formulation  is  superior  to  the  approaches  cur¬ 
rently  in  use.  Our  paper  is  organized  as  follows: 

•  Section  2  describes  the  abstract  tcisk  of  production 
scheduling  and  the  sources  of  uncertainty  which 
make  the  task  difficult  for  current  approaches.  It 
also  gives  details  of  the  particular  scheduling  in¬ 
stance  we  have  worked  on  in  collaboration  with  a 
major  U.S.  food  manufacturer. 

•  Section  3  introduces  the  MDP  model  of  the 
scheduling  task  and  its  solution  based  on  value 
functions.  A  simple  example  illustrates  that  in 
the  presence  of  uncertainty,  the  MDP  model  pro¬ 
duces  the  optimal  solution  where  both  open-loop 
and  closed-loop  planners  do  not.  We  then  discuss 
two  reinforcement  learning  algorithms.  Memory- 
based  RTDP  and  ROUT,  which  are  applicable  for 
solving  large-scale  MDPs  by  value  function  ap¬ 
proximation. 

•  Section  4  presents  experimental  results  with 
ROUT  and  Memory-based  RTDP  on  two  some¬ 
what  simplified  versions  of  the  real-world  man¬ 
ufacturing  task.  The  results  compare  favorably 
to  greedy  and  simulated  annealing  algorithms 
in  both  noisy  and  (surprisingly)  deterministic 
scheduling  scenarios. 

•  Finally,  Section  5  discusses  our  results,  related 
work,  and  promising  future  directions. 

2  Production  Scheduling 

2.1  Problem  Specification 

Production  scheduling  is  the  problem  of  deciding  how 
to  configure  a  factory  sequentially  to  meet  demands. 
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Figure  1:  A  demand  curve  for  one  product  (see  text 
for  explanation) 


Figure  2:  Factory  layout  (see  text  for  explanation) 


We  restrict  our  attention  here  to  a  type  of  produc¬ 
tion  scheduling  called  “make  to  stock.”  We  assume 
we  have  a  modest  number  of  products  (2-100)  and 
must  produce  enough  of  each  to  keep  warehouse  stocks 
high  enough  to  satisfy  customer  requests  for  bulk  ship¬ 
ments.  This  production  model  is  common  for  most 
goods  found  in  a  supermarket.  Automobile  produc¬ 
tion,  by  contrast,  is  typically  not  scheduled  under  this 
model  since  cars  are  assembled  individually  with  dif¬ 
ferent  options  depending  on  specific  customer  orders. 

An  instance  of  the  production  scheduling  problem  is 
composed  of  five  parts: 

Machines  and  products.  This  is  a  list  of  what  ma¬ 
chines  are  present  in  the  factory,  and  what  prod¬ 
ucts  can  be  made  on  the  machines.  There  may 
be  complex  constraints  such  as  “machine  A  can 
only  make  product  1  when  machine  B  is  not  mak¬ 
ing  product  3.”  A  complete,  legal  assignment  of 
products  onto  the  set  of  machines  is  called  a  con¬ 
figuration.  There  is  also  a  special  “closed”  con¬ 
figuration  which  represents  a  decision  to  shut  the 
factory  down. 


Changeover  times.  It  generally  takes  a  certain 
amount  of  time  to  switch  the  factory  from  one 
configuration  to  another.  During  that  time,  there 
is  no  production.  The  problem  definition  includes 
a  (possibly  stochastic)  estimate  of  how  long  it 
takes  to  change  each  configuration  to  each  other 
configuration. 

Production  rates.  Each  configuration  produces  a 
set  of  products  at  a  certain  rate.  There  may  be 
dependencies  between  the  machines.  For  exam¬ 
ple,  machine  B  may  produce  product  2  faster  if 
machine  A  is  also  producing  product  2.  The  ac¬ 
tual  production  rates  in  the  factory  may  be  very 
stochastic;  for  example,  some  machines  may  jam 
frequently,  causing  irregular  delays  on  the  produc¬ 
tion  line. 

Inventory  demand  curves.  At  the  time  a  schedule 
is  created,  a  demand  curve  for  each  product  is 
available  from  a  corporate  marketing  and  fore¬ 
casting  group.  As  shown  in  Fig.  1,  each  curve 
starts  at  the  left  with  the  current  inventory  of 
that  product.  The  inventory  decreases  over  time 
as  future  product  shipments  are  made  and  eventu¬ 
ally  goes  below  zero  if  no  new  production  occurs. 
To  avoid  penalties,  the  scheduler  should  call  for 
more  production  before  the  demand  curve  falls  be¬ 
low  zero.  These  curves  may  also  change  over  time 
eis  new  information  about  future  product  demand 
becomes  available. 

Schedule  costs.  Running  a  schedule  generates  a  dol¬ 
lar  measure  of  net  profit  or  loss.  This  includes  the 
costs  of  running  the  factory,  paying  the  workers, 
purchasing  the  raw  materials,  and  carrying  inven¬ 
tory  at  the  warehouse,  which  are  all  real  dollar 
costs.  It  also  includes  heuristic  costs  such  as  an 
estimate  of  the  damage  done  by  failing  to  fill  a 
customer  request  when  the  warehouse  inventory 
goes  to  zero.  Finally,  it  includes  the  revenue  gen¬ 
erated  from  selling  product  to  a  customer.  The 
final  cost  (or  profit)  of  a  schedule  is  the  sum  of 
all  these  real  dollar  costs,  heuristic  penalties,  and 
revenue. 

Given  this  problem  description,  the  task  of  production 
scheduling  is  to  maximize  expected  profit  by  selecting 
factory  configurations  over  a  period  of  time.  In  cases 
where  the  production  rates  and  demand  curves  are  as¬ 
sumed  deterministic,  the  problem  reduces  to  finding 
the  optimal  open-loop  schedule:  that  is,  find  a  fixed 
sequence  of  configurations  that  maximizes  profit.  In 
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the  general  stochastic  case,  the  optimal  choice  of  con¬ 
figuration  at  time  t  will  depend  on  the  outcomes  of 
earlier  configurations,  so  the  optimal  solution  has  the 
form  of  a  closed-loop  scheduling  policy. 

2.2  A  Real  Production  Scheduling  Problem 

We  have  devoted  considerable  effort  to  optimizing  the 
production  scheduling  of  a  particular  U.S.  factory.  The 
physical  layout  of  one  production  line  in  the  factory  is 
shown  in  Fig.  2.  Raw  materials  enter  the  factory  and 
are  processed  using  a  (proprietary)  system  that  creates 
up  to  twelve  output  streams  of  finished  products  si¬ 
multaneously.  Depending  on  how  numerous  machines 
and  links  between  machines  are  configured,  the  rate 
of  production  of  each  of  the  twelve  kinds  of  products 
varies.  Production  costs  (caused  by  fuel  uses,  person¬ 
nel  costs,  and  wasted  material)  also  vary  according  to 
the  factory  configuration. 

Taking  into  account  all  the  constraints  between  ma¬ 
chines  in  the  factory,  there  are  about  100,000  different 
possible  configurations.  Factories  of  this  type  typically 
produce  on  the  order  of  $50  million  to  $2  billion  worth 
of  product  annually,  so  the  opportunities  for  cost  sav¬ 
ings  via  improved  scheduling  are  large. 

2.3  Conventional  Solution  Methods 

Production  scheduling  is  difficult  to  model  within  the 
standard  job-shop  scheduling  paradigm.  In  job-shop 
scheduling,  the  problem  is  to  complete  a  batch  of 
atomic  jobs  under  ordering  constraints  and  constraints 
on  which  machines  can  handle  which  jobs,  and  at 
what  speeds  and  costs.  This  model  cannot  readily  be 
adapted  to  handle  production  rate  interdependencies 
among  machines,  the  desire  to  keep  inventory  levels 
above  zero  at  all  times  (rather  than  just  completing 
jobs  by  their  deadlines),  and  stochasticity  of  demand 
forecasts  and  production. 

Constraint  propagation  methods  (e.g.  [Zweben  and 
Fox,  1994])  are  commonly  used  to  solve  industrial 
problems.  They  operate  by  efficiently  managing  con¬ 
straints  on  production  deadlines  and  machine  capabil¬ 
ities.  Solution  methods  tend  to  search  by  iteratively 
fixing  violated  constraints,  applying  heuristics  to  guide 
the  fixes.  Constraint  propagation  focuses  primarily  on 
generating  feasible  schedules,  and  only  secondarily  on 
cost  optimality.  This  is  appropriate  when  feasibility  is 
diflScult,  but  not  as  good  in  “make  to  stock”  scenarios 
where  feasibility  is  easy  and  cost  reduction  is  the  main 
goal.  Constraint  propagation  will  not  receive  further 
consideration  here  for  that  reason. 


When  cost  optimality  is  the  primary  scheduling  objec¬ 
tive,  global  optimization  techniques  such  as  simulated 
annealing  (SA)  are  a  good  option.  These  methods 
search  a  space  of  fully-instantiated  schedules  to  find 
the  best  ones.  However,  neither  constraint  propaga¬ 
tion  nor  simulated  annealing  is  naturally  formulated 
to  handle  stochastic  problems.  They  can  be  modified 
for  nondeterminism  in  two  ways; 

•  Optimization  open-loop:  Search  for  the  fixed 
schedule  s  which  maximizes  the  average  profit 
over  several  independent  stochastic  simulations  of 
s.  Here,  all  the  computation  is  spent  at  the  be¬ 
ginning,  and  the  resulting  best  schedule  is  exe¬ 
cuted  without  observing  actual  production  statis¬ 
tics  along  the  way.  This  algorithm  suffers  because 
it  cannot  update  the  schedule  to  account  for  vari¬ 
ances  in  actual  production.  To  compensate  for 
this  inadequacy,  “replanning”  methods  are  usu¬ 
ally  adopted. 

•  Replanning  closed-loop:  When  possible,  this 
method  starts  with  the  open-loop  stochastic  eval¬ 
uation  from  the  previous  option.  For  feasibility- 
based  methods  it  must  start  with  a  deterministic 
version  of  the  problem.  In  either  case,  it  uses  its 
first  schedule  only  to  make  some  initial  scheduling 
decisions.  Then,  whenever  the  result  of  an  action 
with  a  stochastic  outcome  is  observed,  it  replans 
the  remainder  of  the  schedule  in  order  to  make 
new  decisions. 

The  closed-loop  method  can  produce  good  results. 
However,  it  is  computationally  quite  expensive.  More¬ 
over,  although  it  replans  on  every  step,  its  policy  does 
not  take  advantage  of  the  fact  that  it  will  be  able  to 
replan  in  the  future — and  as  we  show  in  Section  3.2 
below,  this  dooms  it  to  being  unable  to  attain  the  op¬ 
timal  profit,  no  matter  how  much  computation  time  it 
is  allowed. 

3  Production  Scheduling  with  Value 
Functions 

This  section  describes  a  principled  approach  to  gener¬ 
ating  closed-loop  production  scheduling  policies  with 
reinforcement  learning  methods.  The  approach  is 
based  on  repre.senting  the  problem  as  an  MDP  and  rep¬ 
resenting  the  solution  as  an  approximate  value  func¬ 
tion. 
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3.1  Production  Scheduling  as  an  MDP 

Abstractly,  a  Markov  Decision  Process  (MDP)  is  de¬ 
fined  by  a  state  space  X,  action  set  A,  immediate 
reward  function  R{x,a),  and  probabilistic  transition 
model  P{x'\x,  a).  The  solution  to  the  MDP  is  a  policy 
IT*  :X  A  which,  if  followed  by  the  agent,  will  max¬ 
imize  the  expected  long-term  sum  of  rewards  attain¬ 
able  starting  from  any  state  x.  Dynamic  programming 
methods  tabulate  this  optimal  cumulative  reward  in 
the  optimal  value  function  V*(x),  which  is  the  unique 
solution  to  the  Bellman  equations  [Bellman,  1957]: 

l/*(x)  =  max 

a^A 

Once  V*  is  computed,  the  optimal  policy  x*  is  imme¬ 
diately  obtained  by  choosing  any  action  which  instan¬ 
tiates  the  max  in  Eq.  1. 

The  production  scheduling  problem  is  modeled  very 
naturally  as  a  Markov  Decision  Process,  as  follows: 

•  The  system  state  is  defined  by  the  current  time 
t  €  0...T;  the  current  inventory  of  each  prod¬ 
uct  pi...pN‘,  and,  if  there  are  configuration- 
dependent  changeover  times,  the  current  factory 
configuration. 

•  The  action  set  consists  of  all  legal  factory  configu¬ 
rations.  We  assume  a  discrete- time  model,  so  the 
configuration  chosen  at  time  t  will  run  unchanged 
until  time  t -f  1. 

•  The  stochastic  transition  function  applies  a  simu¬ 
lation  of  the  factory  to  compute  the  change  in  all 
inventory  levels  realized  by  running  configuration 
ct  for  one  timestep.  This  model  handles  random 
variations  in  production  rates  straightforwardly; 
it  also  handles  changeover  times  by  simply  de¬ 
creasing  production  in  proportion  to  the  (possibly 
stochastic)  downtime.  The  time  t  is  incremented 
on  each  step,  and  the  process  terminates  when 
t  =  T. 

•  The  immediate  reward  function  is  computed  from 
the  inventory  levels,  based  on  the  demand  curve 
at  time  t.  It  incorporates  the  revenues  from  pro¬ 
duction,  penalties  from  late  production,  employee 
costs,  operating  costs,  raw  material  costs,  and 
changeover  cost  incurred  during  the  period.  On 
the  final  time  period  (transition  from  t  =  T-l  to 
T),  a  terminal  “reward”  assigns  additional  penal¬ 
ties  for  any  outstanding  unsatisfied  demands. 


The  MDP  representation  suits  this  problem  very  well, 
for  two  main  reasons.  First,  in  contrast  to  other  tra¬ 
jectory  optimization  tasks  (e.g.,  the  Travelling  Sales¬ 
man  Problem),  the  utility  of  future  decisions  does 
not  depend  on  the  entire  sequence  of  previous  action 
choices  and  outcomes,  but  only  on  a  relatively  compact 
state  description — the  current  time  and  inventory  lev¬ 
els.  Simulated  annealing  and  other  global  optimization 
methods  do  not  require  this  Markov  property — nor  can 
they  exploit  it.  Second,  the  model  fully  represents  un¬ 
certainty  in  production  rates  and  changeover  times. 
As  defined  here,  the  model  also  handles  noise  in  the 
demands  if  that  noise  is  time-independent,  but  it  can¬ 
not  account  for  the  possibility  of  the  demand  curves 
being  randomly  updated  in  the  middle  of  a  schedule, 
since  that  would  make  the  MDP  transition  probabili¬ 
ties  nonstationary. 

The  value  function  for  this  MDP  specifies  a  closed- 
loop  scheduling  policy  which  makes  optimal  decisions 
with  full  foresight  of  the  remaining  uncertainty  in  the 
process.  No  method  based  on  global  optimization  can 
make  this  claim,  even  if  replanning  is  allowed,  as  we 
now  illustrate. 


3.2  Illustrative  Example 

This  example  illustrates  how  MDP  solutions  optimally 
solve  sequential  decision  problems  that  methods  based 
on  replanning  cannot.  Suppose  we  are  asked  to  sched¬ 
ule  the  production  of  12  units  of  a  single  product  over 
two  days.  On  each  day  we  can  choose  one  of  the  fol¬ 
lowing  three  configurations: 


Configuration 

1 

with  gets 

probability  production 
0.5  3 

0.5  6 

Cost 

$1 

2 

1.0  6 

$4 

3 

1.0  9 

$8 

In  addition  to  the  per-configuration  costs  listed  in  the 
table,  there  is  an  additional  cost  of  $8  for  each  unit 
under  12  not  produced  at  the  end  of  two  days.  The 
following  table  shows  the  expected  cost  of  each  of  the 
possible  schedules.  (Note  that  in  this  example,  the 
expected  cost  of  a  schedule  [a5]  is  the  same  when  the 
sequence  is  reversed,  [6a],  so  redundant  schedules  are 
omitted  from  the  table.) 
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Config 

Sequence 

Config 

Cost 

Expected  Missed 
Production  Cost 

Total 

Cost 

1  1 

$2 

0.25’^$48  -1-  0.5’^$24 

$26 

1  2 

$5 

0.5*$24 

$17 

1  3 

$9 

$0 

$9 

2  2 

$8 

$0 

$8 

2  3 

$12 

$0 

$12 

3  3 

$16 

$0 

$16 

Based  on  these  costs,  a  replanning-based  scheduler  will 
choose  sequence  [2  2],  It  will  execute  configuration  2 
on  the  first  day,  and  then  have  an  opportunity  to  re¬ 
plan  for  day  2  based  on  the  results  of  day  1.  Since 
the  production  from  configuration  2  on  day  1  is  de¬ 
terministic  (6  units),  the  scheduler  will  again  choose 
configuration  2  on  day  2,  thereby  completing  the  2-day 
production  run  with  a  total  cost  of  $8. 

The  replanning-based  scheduler  makes  a  suboptimal 
decision  on  day  1  because  it  doesn’t  “know”  that  it 
will  be  given  the  chance  to  replan  after  the  first  day’s 
production  is  observed.  By  contrast,  with  the  ability 
to  exploit  this  knowledge,  the  MDP  solution  makes  the 
correct  scheduling  decision  of  action  1  on  day  1.  The 
following  table  evaluates  the  choices  for  day  1  by  show¬ 
ing  all  the  possible  outcomes  followed  by  the  optimal 
day  2  choice  for  each  outcome. 


day  1 
confipr 

with 

prob 

units 

made 

day  2 
config 

with 

prob 

units 

made 

expected 

cost 

1 

■IQI 

3 

3 

1.0 

9 

6 

2 

1.0 

6 

2 

6 

2 

1.0 

6 

3 

9 

1 

3 

6 

By  considering  all  the  possible  outcomes  and  the  opti¬ 
mal  decisions  that  will  be  made  for  each  one,  the  MDP 
solution  chooses  configuration  1  on  the  first  day  and 
achieves  an  expected  cost  of  7  as  compared  to  8  ob¬ 
tained  by  replanning.  This  type  of  tradeoff  exists  in 
real  factories  as  well.  There  is  often  a  choice  of  how 
fast  to  run  the  production  line  that  trades  off  higher 
production  rates  against  higher  unit  costs. 

3.3  Value  Function  Approximation 

In  practical  scheduling  problems,  tabulating  V*{x) 
for  every  possible  state  of  the  factory  is  completely 
intractable.  Instead,  we  use  reinforcement  learning 
methods  to  represent  V*  compactly  with  a  function 
approximator,  such  as  global  or  local  polynomial  re¬ 
gression.  The  two  methods  we  tested  are  Memory- 
based  RTDP  and  ROUT. 


3.3,1  Memory-Based  RTDP 

Memory-based  RTDP  is  a  reinforcement  learning  ap)- 
proach  that  is  closely  related  to  RTDP  (Real-Time 
Dynamic  Programming)  [Barto  et  al.,  1995]  and 
to  Tesauro’s  application  of  TD(0)  to  the  game  of 
backgammon  [Sutton,  1988,  Tesauro,  1992].  It  is  also 
similar  to  the  instance-based  approach  to  represent¬ 
ing  value  functions  used  in  [Peng,  1993].  Trajectories 
through  the  MDP  model  are  generated  repeatedly,  us¬ 
ing  the  current  approximation  of  the  value  function  to 
guide  standard  Boltzmann-style  exploration  [Barto  et 
al.,  1995].  At  each  step  of  each  trajectory,  a  one-step 
backup  operation  (Eq.  1)  is  performed  and  the  func¬ 
tion  approximator  is  updated. 

In  Memory-based  RTDP,  the  value  function  is  repre¬ 
sented  by  a  nonparametric  memory-based  function  ap¬ 
proximator  [Cleveland  and  Delvin,  1988,  Moore  et  al, 
1995,  Atkeson  et  al,  1995].  Memory-based  learning 
simply  accumulates  training  data  points,  rather  than 
running  a  training  algorithm  on  them.  Then  whenever 
a  query  is  made,  the  approximator’s  output  is  com¬ 
puted  by  a  weighted  average  or  weighted  polynomial 
regression  over  nearby  points  in  memory. 

Achieving  good  performance  with  Memory-based 
RTDP  requires  an  appropriate  choice  of  the  Boltz¬ 
mann  exploration  temperature  and  the  local  regression 
kernel  width.  These  values  were  tuned  empirically  to 
obtain  the  results  presented  in  Section  4.  Although  the 
training  points  generated  by  Memory-based  RTDP’s 
early  trajectories  are  undoubtedly  inaccurate  samples 
of  V*,  we  did  not  find  it  necessary  to  include  an  ex¬ 
plicit  “forgetting”  mechanism  in  the  learning;  the  bad 
points  are  quickly  overwhelmed  by  later,  more  accu¬ 
rate  samples. 

3.3.2  ROUT 

ROUT  is  an  active  learning  algorithm  for  value  func¬ 
tion  approximation  that  is  specifically  designed  for  the 
subclass  of  acyclic  MDPs  [Boyan  and  Moore,  1996]. 
Note  that  the  scheduling  MDP  is  certainly  acyclic, 
since  its  state  representation  includes  the  time  counter 
t.  Using  simulations  of  the  process,  ROUT  repeatedly 
identifies  a  new  state  x  at  which  (1)  the  function  ap¬ 
proximator  is  currently  in  error,  and  (2)  an  accurate 
sample  of  V*  can  be  obtained  from  a  1-step  backup. 
Unlike  Memory-based  RTDP  and  most  other  reinforce¬ 
ment  learning  methods,  ROUT  explicitly  tries  to  pre¬ 
vent  the  function  approximator  from  seeing  any  inac¬ 
curate  samples  of  V* . 

Details  of  how  ROUT  identifies  such  states  automat- 
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ically  are  given  in  [Boyan  and  Moore,  1996].  One  by 
one,  these  useful  states  are  accumulated  into  a  train¬ 
ing  set  of  accurate  samples  of  (a:).  The  training  set 
grows  backwards  from  the  terminal  states.  As  soon  as 
the  start  state  xo  is  itself  added  to  the  training  set, 
ROUT  declares  victory,  outputs  its  learned  training 
set  and  learned  approximation  of  V* ,  and  terminates. 

If  the  function  approximator  cannot  represent  V*  ac¬ 
curately,  then  ROUT  may  become  stuck,  repeatedly 
adding  points  near  the  terminal  states  and  never  pro¬ 
gressing  backwards.  However,  if  the  function  approx¬ 
imator  can  represent  U*  to  within  the  specified  tol¬ 
erance,  then  ROUT  can  be  guaranteed  to  eventually 
find  it.  For  ROUT  to  find  V*  efficiently,  the  func¬ 
tion  approximator  must  extrapolate  well  from  a  small 
training  set. 

4  Experimental  Results 

We  have  experimented  with  two  instances  of  the  real- 
world  production  scheduling  task  described  in  Sec¬ 
tion  2.2.  The  first  instance  is  heavily  simplified  so 
that  the  exact  optimal  closed-loop  scheduling  policy 
can  be  calculated  tractably.  The  second  instance  is  a 
more  realistic  model,  for  which  only  heuristic  solutions 
are  available. 

4.1  Simplified  Scheduling  Instance 

In  the  simplified  instance,  the  task  is  to  schedule  8 
weeks  of  production;  however,  configurations  may  be 
changed  only  at  2- week  intervals,  and  only  17  config¬ 
uration  choices  are  available.  Of  these  17,  nine  have 
deterministic  production  rates;  the  other  eight  each 
have  two  stochastic  outcomes,  producing  only  1/3  of 
their  usual  amount  with  probability  0.5.  With  a  to- 
tal  of  9  X  1  -{-  8  X  2  =  25  outcomes  possible  from  ev¬ 
ery  state,  there  are  25'*  =  390,625  possible  trajecto¬ 
ries  through  the  space.  The  optimal  policy  can  be 
computed  by  tabulating  U*(x)  at  every  possible  in¬ 
termediate  state  X  of  the  factory,  of  which  there  are 
l-f  25  +  25^  -I-  25®  =  16,276.  The  optimal  policy  re¬ 
sults  in  an  expected  cumulative  reward  of  — $22.8M. 
By  contrast,  a  random  schedule  attains  a  reward  of 
-$923M  on  average!  A  greedy  policy,  which  at  each 
step  selects  a  configuration  to  maximize  only  the  one- 
step  reward  from  the  current  state,  attains  — $97.9M. 

We  applied  ROUT  to  this  instance,  trying  three 
different  function  approximators:  1-nearest  neigh¬ 
bor,  locally-weighted  linear  regression,  and  global 
quadratic  regression.  KD-trees  were  used  to  keep  the 


computation  efficient  [Moore  et  al.,  1997].  For  the  lo¬ 
cally  weighted  regression,  a  kernel  width  of  2“®  of  the 
range  of  each  input  dimension  in  the  training  data  was 
used,  rout’s  exploration  and  tolerance  parameters 
were  tuned  mannally.  Table  1  summarizes  the  results. 

When  nearest-neighbor  was  used  as  the  function  ap¬ 
proximator,  ROUT  did  not  obtain  sufficient  general¬ 
ization  from  its  training  set  and  failed  to  terminate 
within  a  limit  of  several  hours.  However,  with  both 
local  linear  and  global  quadratic  regression  models, 
ROUT  did  run  to  completion  and  produced  an  approx¬ 
imate  value  function  which  significantly  outperformed 
the  greedy  policy.  Moreover,  over  half  of  the  ROUT 
runs  did  indeed  terminate  with  the  optimal  closed-loop 
scheduling  policy.  In  these  cases,  ROUT’s  final  self- 
built  training  set  for  value  function  approximation  con¬ 
sisted  of  only  about  100-150  training  points — a  sub¬ 
stantial  reduction  over  the  16,276  required  for  full  tab¬ 
ulation  of  V* .  rout’s  total  running  time  (w  1  hour 
on  a  200  MHz  Pentium  Pro)  was  roughly  half  of  that 
required  to  enumerate  V*  manually. 

From  these  preliminary  results,  we  conclude  that 
ROUT  does  indeed  have  the  potential  to  approximate 
V*  extremely  well,  given  a  suitable  function  approx¬ 
imator  for  the  domain.  However,  since  it  runs  quite 
slowly  on  even  this  simplified  problem,  we  believe 
ROUT  will  not  scale  up  to  practical  scheduling  in¬ 
stances  without  further  refinements. 

4.2  Practicrd  Scheduling  Instance 

In  this  section  we  present  experimental  results  on  a 
larger  scheduling  problem.  In  doing  so,  we  lose  the 
ability  to  determine  the  optimal  policy  for  compari¬ 
son.  However,  it  gives  a  better  demonstration  of  how 
the  competing  methods  perform  on  industrial-scale 
scheduling  problems.  The  task  is  to  schedule  eight 
weeks  of  production  at  one  week  intervals.  There  are 
eight  products,  eight  machines,  and  a  total  of  421  le¬ 
gal  configurations  to  consider,  including  the  “closed” 
configuration. 

Our  experiments  consider  both  deterministic  and  noisy 
versions  of  the  problem.  To  build  the  deterministic 
version  of  the  problem,  we  ran  long  (stochastic)  sim¬ 
ulations  for  each  of  the  421  actions  and  cached  the 
mean  observed  production  rate  for  each.  For  the  noisy 
versions,  we  could  have  used  the  noisy  outcomes  di¬ 
rectly  from  the  stochastic  simulation,  but  instead  we 
simply  added  Gaussian  noise  to  the  cached,  determin¬ 
istic  production  rates.  This  enabled  our  experiments 
to  run  significantly  faster,  and  also  allowed  us  to  eas- 
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Algorithm 

Mean  Profit 

95%  C.I. 

optimal  runs 

Optimal 

-22.8 

1 

Random 

-923.2 

±58.7 

0 

Greedy 

-97.9 

±15.1 

0 

ROUT  -|-  global  quadratic 

-57.0 

±23.5 

10/16 

Hou  r  -b  local  linear 

-45.0 

±16.9 

10/16 

Table  1:  Results  for  4-timestep,  17-configuration  stochastic  scheduling  problem. 


ily  generate  empirical  results  with  varying  amounts  of 
noise. 

Table  2  shows  experimental  results.  The  computation 
times  reported  are  on  a  200  MHz  Pentium  Pro.  The 
first  section  contains  results  for  the  case  where  the  fac¬ 
tory  output  is  deterministic  and  known.  The  purpose 
of  the  first  two  lines  is  to  delimit  the  range  of  results  we 
should  expect  from  good  algorithms.  The  “Random” 
algorithm  builds  a  schedule  by  choosing  8  configura¬ 
tions  at  random,  and  it  loses  an  enormous  amount  of 
money.  Much  of  the  cost  is  due  to  heuristic  penalties 
for  failing  to  satisfy  customer  demand. 

The  “Planit”  algorithm,  developed  by  Schenley  Park 
Research,  is  the  proprietary  algorithm  currently  used 
to  schedule  the  real  factory’s  production.  It  has  sev¬ 
eral  advantages  over  the  other  algorithms  in  this  table. 
First,  it  is  finely  tuned  to  schedule  this  factory  using 
a  combination  of  simulated  annealing,  linear  program¬ 
ming,  constraint  propagation,  and  several  heuristics. 
Second,  it  is  not  restricted  to  choosing  configurations 
for  pre-discretized  time  steps,  but  can  choose  an  ar¬ 
bitrary  number  of  configurations  and  switch  between 
them  at  arbitrary  times.  Our  experience  with  this 
scheduler  leads  us  to  believe  that  the  average  profit 
of  $13.81M  is  very  near  optimal  for  this  instance,  so  it 
can  be  considered  an  unattainable  upper  bound  for  the 
other  results.  In  particular,  Planit  achieves  its  results 
by  using  an  average  of  around  13  configurations  in  its 
schedules  while  the  other  algorithms  are  restricted  to 
8  fixed-sized  time  steps.  It  usually  incurs  no  heuristic 
penalties  in  its  schedules,  so  that  figure  is  a  profit  in 
real  dollars. 

The  simulated-annealing,  greedy-exploration,  and 
Memory-based  RTDP  algorithms  are  run  as  described 
in  the  previous  sections.  The  simulated  annealing  runs 
made  use  of  the  successful  “modified  Lam”  adaptive 
annealing  schedule  [Ochotta,  1994].  Memory-based 
RTDP  used  kernel  regression  with  a  kernel  width  of 
2~®  of  the  range  of  each  state  variable,  and  used  KD- 
trees  for  efficiency  [Moore  et  al.,  1997].  Boltzmann 
exploration  (without  cooling)  was  used  for  the  deter¬ 


ministic  case,  but  proved  unnecessary  in  the  stochastic 
case  because  the  noise  alone  caused  sufficient  explo¬ 
ration. 

The  poor  result  from  Greedy  in  the  deterministic  case 
shows  that  generating  trajectories  based  solely  on  the 
one-step  cost  of  configurations  is  not  an  effective  way 
to  search,  even  when  compared  to  a  randomized  search 
method  such  as  simulated  annealing.  The  search  effi¬ 
ciency  gained  by  computing  a  value  function  is  shown 
by  the  favorable  Memory-based  RTDP  results.  They 
are  obtained  from  only  200  trajectories  through  the 
state  space,  meaning  the  value  function  at  each  time 
step  is  represented  with  200  training  points.  All  of  the 
algorithms  can  do  better  with  more  computation  time, 
but  they  were  cut  off  at  10  minutes  since  Planit  gets 
its  results  in  that  much  time. 

The  second  and  third  sections  of  the  table  show  results 
with  10%  and  20%  noise  added.  The  Planit  algorithm 
cannot  be  run  in  these  cases  since  it  does  not  handle 
stochastic  outcomes;  however,  we  still  expect  its  re¬ 
sult  in  the  deterministic  case  to  be  a  reasonable  upper 
bound  for  the  other  algorithms. 

Open-loop  simulated  annealing  means  that  all  the 
computation  is  spent  at  the  beginning  and  the  result¬ 
ing  best  schedule  is  executed  without  observing  ac¬ 
tual  production  statistics  along  the  way.  This  algo¬ 
rithm  suffers  because  it  cannot  update  the  schedule  to 
account  for  variances  in  actual  production.  By  con¬ 
trast,  closed-loop  simulated  annealing  replans  the  rest 
of  the  schedule  after  each  week  of  actual  production 
is  observed.  In  order  to  keep  the  total  computation 
the  same,  the  computation  allotted  for  each  week’s 
decision  was  divided  by  the  number  of  weeks  (8). 
The  results  show  that  replanning  does  improve  over 
open  loop  execution.  We  note  that  all  the  simulated- 
annealing  schedulers  have  high  variance,  which  can  be 
a  disadvantage  of  using  that  algorithm. 

Memory-based  RTDP  uses  its  computation  at  the  be¬ 
ginning  to  compute  a  value  function.  Each  run  used 
400  trajectories  for  these  results.  The  value  function 
determines  a  closed-loop  policy  valid  for  any  state 
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Noise  level 

Algorithm 

Mean  Profit  95%  C.l. 

Deterministic 

(«  10  min  computation) 

Random 

Planit 

Simulated  Annealing 

Greedy  -I-  Exploration 
Memory-based  RTDP 

-466.35  ±59.45 

13.81  ±0.08 

5.66  ±3.68 

-1.93  ±3.21 

7.70  ±1.57 

10%  Noise 

(«  45  min  computation) 

Greedy  (c.l.) 

Simulated  Annealing  (o.l.) 
Simulated  Annealing  (c.l.) 
Memory-based  RTDP 

-17.69  ±1.94 

6.48  ±1.21 

9.03  ±1.04 

10.16  ±0.84 

20%  Noise 

(«  45  min  computation) 

Greedy  (c.l.) 

Simulated  Annealing  (o.l.) 
Simulated  Annealing  (c.l.) 
Memory-based  RTDP 

-25.92  ±1.12 

2.55  ±1.91 

2.40  ±3.95 

7.02  ±0.67 

Table  2:  Results  for  8-timestep,  421-configuration  scheduling  problem.  The  numbers  shown  represent  profits  in 
millions  of  dollars.  On  the  noisy  problems,  Memory-based  RTDP  is  statistically  better  than  the  other  algorithms 
at  the  95%  significance  level. 


reached  during  actual  production.  As  discussed  ear¬ 
lier,  it  not  only  executes  closed-loop,  but  also  makes 
its  decisions  “knowing”  that  it  will  be  executing  closed- 
loop.  The  results  show  both  a  favorable  expected 
profit  as  well  as  smaller  variance  across  runs. 

5  Discussion  and  Future  Work 

We  expect  Memory-based  RTDP  to  outperform  sim¬ 
ulated  annealing  on  a  stochastic  problem  based  on 
the  intuition  from  Sec.  3.2,  and  our  experimental  re¬ 
sults  show  that  it  does.  It  is  interesting  to  observe 
that  Memory-based  RTDP  does  well  against  simu¬ 
lated  annealing  even  in  the  deterministic  case  where 
the  stochastic  modeling  capability  of  MDPs  is  not 
needed.  This  provides  further  evidence  that  search 
based  on  value  functions  can  improve  efficiency.  While 
simulated  annealing  is  forced  to  try  configurations  at 
random,  value  function  based  methods  can  explicitly 
reason  about  which  intermediate  states  are  good  and 
which  actions  will  reach  those  states. 

To  our  knowledge,  this  work  represents  the  first  appli¬ 
cation  of  reinforcement  learning  to  production  schedul¬ 
ing  with  multiple  products  made  on  multiple  m2ichines. 
The  scheduling  of  machine  maintenance  is  discussed 
in  [Mahadevan  et  ai,  1997],  and  transfer  line  pro¬ 
duction  scheduling  is  discussed  in  [Mahadevan  and 
Theocharous,  1998].'  In  their  task,  each  product  or 
sub-product  is  produced  on  a  single  meichine  and  each 
machine  makes  a  local  decision  on  whether  to  produce 
one  of  its  products  or  go  down  for  maintenance.  A 


reinforcement  learning  approach  to  the  Space  Shuttle 
scheduling  problem  is  described  by  [Zhang  and  Diet- 
terich,  1995].  In  that  framework,  states  are  complete 
schedules  and  actions  are  modification  operators  ap¬ 
plied  to  the  schedules.  Their  feature  representation 
introduces  noise,  but  the  underlying  problem  is  deter¬ 
ministic. 

Our  empirical  work  to  date  covers  stochasticity  only  in 
production.  Another  large  source  of  uncertainty  in  real 
problems  is  the  ineidequEicy  of  demand  forecasts.  This 
can  be  handled  heuristically  within  the  MDP  formu¬ 
lation  described  here  by  the  addition  of  appropriate 
noise  to  the  demands  during  simulations.  However, 
it  may  also  be  possible  to  gain  extra  efficiencies  by 
incorporating  demands  explicitly  into  the  MDP  state 
space.  Further  empirical  work  is  required  to  answer 
that  question. 

As  the  size  of  the  scheduling  problem  increases,  it 
becomes  increasingly  expensive  to  compute  the  value 
function  accurately.  However,  even  an  inexact  value 
function  can  be  useful  as  the  basis  for  a  quasi-greedy 
search  or  “rollout”  search  performed  online  [Tesauro 
and  Galperin,  1997].  We  intend  to  test  such  methods 
in  future  work  on  larger  scheduling  problems. 
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Abstract 

Stochastic  topological  models,  and  hidden 
Markov  models  in  particular,  are  a  useful  tool 
for  robotic  navigation  and  planning.  In  previ¬ 
ous  work  we  have  shown  how  weak  odometric 
data  can  be  used  to  improve  learning  topologi¬ 
cal  models,  overcoming  the  common  problems 
of  the  standard  Baum- Welch  algorithm.  Odomet¬ 
ric  data  typically  contain  directional  information, 
which  imposes  two  difficulties:  First,  the  cyclic¬ 
ity  of  the  data  requires  the  use  of  special  circular 
distributions.  Second,  small  errors  in  the  head¬ 
ing  of  the  robot  result  in  large  displacements  in 
the  odometric  readings  it  maintains.  The  cumu¬ 
lative  rotational  error  leads  to  unreliable  odomet¬ 
ric  readings.  In  the  paper  we  present  solutions 
to  these  problems  by  using  a  circular  distribu¬ 
tion  and  relative  coordinate  systems.  We  validate 
their  effectiveness  through  experimental  results 
from  a  model-learning  application. 

1  INTRODUCTION 

Directional  data  is  information  consisting  of  magnitude 
and  direction.  Such  data  is  an  integral  part  of  important  ap¬ 
plications  in  various  areas  of  computer  science  in  general 
and  artificial  intelligence  in  particular.  In  computer  graph¬ 
ics,  automatic  production  of  pen-and-ink  drawings  and  the 
production  of  animation  based  on  magnetic  trackers  data 
requires  statistical  manipulation  of  directional  data.  In  cog¬ 
nitive  science,  modeling  routes  chosen  by  animals  [4]  re¬ 
quires  a  similar  kind  of  statistical  manipulation.  In  the  area 
of  machine  learning  we  often  use  probabilistic  models  for 
robot  movement.  Most  aspects  of  robot  movement  (arm 
movement  as  well  as  the  whole  body  movement)  can  be 
described  in  terms  of  location  and  heading  change,  requir¬ 
ing  the  use  and  manipulation  of  directional  data. 


Probabilistic  models  are  widely  used  within  the  AI  com¬ 
munity.  Such  models  may  allow  continuous  probabilities, 
as  demonstrated  in  work  on  Bayesian  networks  [7],  hid¬ 
den  Markov  models  [5,  8],  probabilistic  clusters  [2]  and 
stochastic  maps  [19],  to  name  a  few.  However,  the  assump¬ 
tion  underlying  all  the  above  work  is  that  continuous  dis¬ 
tributions  are  linear  —  that  is  —  distributions  that  assign 
density  to  each  point  on  the  real  line  so  that  the  area  un¬ 
der  the  density  curve,  integrated  over  the  whole  real  line,  is 
1.*  Such  models  do  not  take  into  account  directional  data, 
which  is  inherently  cyclic.  Under  circular  distributions  the 
density  of  any  point  x  on  the  real  line  is  the  same  as  that  of 
x  +  kk  where  k  is  any  integer  and  ^  is  some  real  number. 

The  need  for  circular  distributions  has  long  been  realized 
by  statisticians  [6],  but  the  practice  of  using  them  has  not 
found  its  way  into  the  computer  science  community  and 
to  the  machine  learning  community  in  particular.  One  of 
the  goals  of  this  paper  is  to  point  out  the  usefulness  of  one 
specific  circular  distribution  in  the  context  of  robotics,  and 
provide  a  short  tutorial  on  circular  distributions. 

Another  special  aspect  of  directional  data  is  its  sensitiv¬ 
ity  to  errors.  As  most  navigators,  pilots  and  skippers  have 
experienced,  a  small  angular  deviation  from  the  original 
course  causes  a  big  displacement  at  the  final  location.  This 
problem  is  very  prominent  in  mobile  robots,  where  drifts 
and  drags  of  the  wheels  and  disalignment  of  both  engines 
and  floors  can  cause  a  robot  to  face  in  the  wrong  heading 
with  respect  to  its  own  odometric  readings.  Odometric  in¬ 
formation  is  recorded  by  the  robot  along  three  dimensions; 
it  consists  of  the  changes  along  the  x  and  the  y  axis  as  well 
as  a  change  in  the  heading  of  the  robot  within  a  global  co¬ 
ordinate  system.  In  our  previous  work  on  learning  topolog¬ 
ical  models  [17]  we  made  several  assumptions  about  the 
odometric  data: 

•  All  odometric  measures  are  normally  distributed. 


*Most  often  the  distribution  is  Gaussian. 
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•  All  corridors  are  perpendicular  to  each  other. 

•  The  robot,  when  collecting  the  data,  is  using  the  per¬ 
pendicularity  assumption,  and  is  collecting  the  data 
with  respect  to  one  global  coordinate  system. 

This  paper  demonstrates  the  problematic  aspects  of  these 
assumptions  and  introduces  our  solution  to  the  problems, 
together  with  preliminary  results  that  demonstrate  the  ef¬ 
fectiveness  of  our  solution.  The  rest  of  the  paper  is  orga¬ 
nized  as  follows:  Section  2  describes  our  application  and 
motivates  the  need  for  circular  distributions  in  the  context 
of  machine  learning;  Section  3  presents  the  von  Mises  dis¬ 
tribution,  which  is  a  circular  version  of  the  normal  distribu¬ 
tion;  Section  4  discusses  the  problems  faced  due  to  heading 
deviations  and  presents  our  solution  to  the  problem;  Sec¬ 
tion  5  presents  experiments  and  results  to  demonstrate  the 
usefulness  of  our  approach;  Section  6  concludes  the  paper. 

2  LEARNING  TOPOLOGICAL  MODELS 

Hidden  Markov  models  (HMMs),  as  well  as  their  gener¬ 
alization  to  models  for  partially  observable  Markov  deci¬ 
sion  processes  (pomdp  models),  are  a  useful  tool  for  rep¬ 
resenting  environments  such  as  road  networks  and  office 
buildings,  which  are  typical  for  robot  navigation  and  plan¬ 
ning  [1,  14, 18].  Previous  work  on  planning  with  such  mod¬ 
els  typically  assumed  that  the  model  is  manually  provided. 
Manual  acquisition  of  these  models  can  be  very  tedious 
and  hard.  It  is  desirable  to  learn  such  models  automati¬ 
cally,  both  for  robustness  and  in  order  to  cope  with  new  and 
changing  environments.  Since  POMDP  models  are  a  simple 
extension  of  HMMs,  they  can,  theoretically,  be  learned  with 
a  simple  extension  to  the  Baum-Welch  algorithm  [15]  for 
learning  HMMs.  However,  without  a  strong  prior  constraint 
on  the  structure  of  the  model,  the  Baum-Welch  algorithm 
does  not  perform  very  well:  it  is  slow  to  converge,  requires 
a  great  deal  of  data,  and  often  becomes  stuck  in  local  max¬ 
ima.  In  previous  work  [16,  17]  we  demonstrated  how  the 
simple  Baum-Welch  algorithm  can  be  enhanced  with  weak 
local  odometric  information  to  learn  better  models  faster, 
under  the  assumption  listed  above.  For  the  sake  of  com¬ 
pleteness,  we  briefly  review  the  essentials  of  this  work  here. 

A  robot  moves  through  the  corridors  in  an  office  environ¬ 
ment.  Low-level  software  provides  a  level  of  abstraction 
that  allows  the  robot  to  move  through  hallways  from  inter¬ 
section  to  intersection  and  turn  ninety  degrees  to  the  left 
or  right.  At  each  intersection,  ultrasonic  data  interpretation 
lets  the  robot  observe,  in  each  of  the  four  cardinal  direc¬ 
tions,  whether  there  is  an  open  space,  a  door,  a  wall,  or 
something  unknown.  The  robot  also  has  encoders  on  its 
wheels  that  allow  it  to  estimate  its  current  pose  (position 
and  orientation)  with  respect  to  its  pose  at  the  previous  in¬ 
tersection.  Of  course,  the  action  and  perception  routines 


and  the  odometric  measures  are  all  subject  to  error.  The 
learning  task  is  to  deduce  a  model  from  the  recorded  obser¬ 
vations  and  odometric  information. 

Our  learning  algorithm  gets  as  an  input  an  experience  se¬ 
quence  E  of  observations  and  odometric  readings,  and  pro¬ 
duces  as  output  an  HMM^,  A,  of  the  environment,  such  that 
the  likelihood,  Pr(E|A),  is  locally  maximized.  Formally, 
the  standard  HMM  is  defined  as  a  tuple  A  =  {S,  O,  A,  B,  tt), 
where: 

•  5  =  {si, . . . ,  is  a  finite  set  of  N  states; 

•  O  =  ni=i  3  set  of  observation  vectors 
length  /;  the  ith  element  of  an  observation  vector  is 
chosen  from  the  finite  set  Oj; 

•  A  is  a  stochastic  transition  matrix,  with  Aij  = 
Pr{qt+i  =  Sj\qt  =  Si);  1  <  i,j  <  N-,  qt  is  the  state 
at  time  t; 

•  B  is  an  array  of  I  stochastic  observation  matrices,  with 
B,j,o  =  Pr{Vt[i]  =  o\qt  =  Sj)\  \<i  <  I,  1  <j < AT, 
o  €  Oj\  Vt  is  the  observation  vector  at  time  t\ 

•  TT  is  a  stochastic  initial  probability  vector  describing 
the  distribution  of  the  initial  state. 

Odometric  information  gathered  by  the  robot  is  not  an  in¬ 
herent  part  of  the  topological  model,  but  is  used  by  the 
learning  algorithm  to  better  identify  and  distinguish  states. 
To  facilitate  the  use  of  this  information  we  augment  the 
standard  model  with  the  odometric  relation  matrix: 

•  B  is  a  relation  matrix,  specifying  for  each  pair  of  states. 

Si  and  Sj,  the  mean  and  variance  of  the  D-dimensional 
metric  relation  between  them;  yA  =  /i(Bi,j[d])  is 
the  mean  of  the  dP  component  of  the  relation  be¬ 
tween  Si  and  Sj  and  =  cr^(Bi,j[d]),  the  vari¬ 

ance,  where  \  <  d  <  D.  Furthermore,  R  is  geo¬ 
metrically  consistent:  for  each  component  d,  the  rela¬ 
tion  R’^{a,  b)  =  fi{Ra,b[d])  must  satisfy  the  following 
properties  for  all  states  a,  b,  and  c: 

o  R‘‘{a,  a)  =  0; 

o  R‘^{a,b)  =  -R'^{b,  a)  (anti-symmetry);  and 
o  R^{a,c)  =  i?'^(a,6)  -|-  c)  (additivity); 

The  odometric  information  recorded  by  the  robot  at  time  t, 
rt,  consists  of  the  change  in  the  x  and  y  coordinates  of  the 
odometric  readings  when  moving  from  state  qt-i  to  state 
qt,  as  well  as  the  change  of  the  robot’s  heading,  9,  between 
these  states. 

An  arbitrary  initial  model  Aq  is  assumed.  Then  an  expecta¬ 
tion  maximization  algorithm  [3]  is  executed  as  follows: 

^We  discuss  here  hmms  rather  than  pomdp  models.  Extension 
to  POMDPs  is  straightforward,  but  notationally  more  cumbersome. 
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■  a  b  » 

<x,y.e>  <x, y.0^18O> 

Figure  1:  Robot  changes  heading  from  state  a  to  state  b. 

•  E-step:  computes  the  state-occupation  and  transi¬ 
tion  probabilities,  7t(i)  =  Pr(9f  =  SijE,  A)  and 

j)  =  MQt  =  Si,  qt+i  =  Sj  |E,  A),  respectively, 
at  each  time  t  in  the  sequence,  given  E  and  the  current 
model  A,  and 

•  M-step:  finds  a  new  model  A  that  maximizes 

Pr(ElA,7,0- 

Introducing  odometric  information  requires  iterative  up¬ 
dates  of  the  odometric  relations  between  pairs  of  states,  in 
the  relation  matrix,  R.  The  updates  need  to  maintain  the 
properties  listed  above,  although  cunently  the  update  pro¬ 
cedure  only  satisfies  the  first  two. 

The  learning  task  is  further  complicated  by  the  special  na¬ 
ture  of  the  heading  reading  and  the  rotational  errors  ac¬ 
crued.  The  following  section  goes  in  more  detail  into  the 
special  issues  of  handling  the  heading  information.  The 
rest  of  the  paper  deals  with  resolving  the  problems  caused 
by  rotational  errors. 

3  DIRECTIONAL  DATA  AND 
DISTRIBUTIONS 

Suppose  a  robot  is  in  state  o,  which  is  in  location  (x,y) 
facing  in  direction  Q,  as  shown  in  figure  1.  By  turning 
backwards,  it  transitions  to  state  6,  and  a  respective  change 
of  heading  of  approximately  ±180°  is  recorded.  Thus  the 
new  recorded  configuration  of  the  robot  is  (x  ±  €i ,  y  ±  e2, 
0  ±  180°  ±  €3),  where  Cj  is  the  error  due  to  inaccuracy  in 
both  measurement  and  movement.  In  earlier  work  [17], 
we  treated  all  errors  —  in  both  location  (x,  y)  and  head¬ 
ing  (0)  —  as  if  they  were  normally  distributed.  However, 
the  change  in  heading  is  different  from  changes  in  x  and  y, 
since  angular  measurements  are  cyclic.  That  is,  a  change 
in  heading  of  6°  is  the  same  as  that  of  0  ±  360°A:,  for  any 
integer  fc. 

If  we  knew  in  advance,  for  every  pair  of  states,  the  ap¬ 
proximate  change  in  heading  (A©)  between  them,  we 
could  have  modeled  it  as  normal  with  mean  A0,  and 
small  variance  We  could  have  adopted  a  convention, 
normalizing  all  angles  to  be  within  a  cyclic  range,  e.  g. 
[-180°,  180°],  (similarly  we  may  use  radians),  and  always 
chosen  to  take  as  the  angular  change  between  two  points 
mm(lA01,36O°  -  |A0|),  and  assigned  it  the  correct  sign. 
Such  an  approach  of  using  a  non-circular  distribution  is  jus¬ 
tified  when  the  estimation  of  a  position  is  based  only  on 
readings  a-priory  known  to  be  taken  near  this  position,  (see 
for  example  work  by  Thrun  et  al  [20]  and  Lu  et  al  [12]). 


However,  we  do  not  know  in  advance  the  angles  between 
states.  The  data  is  a  sequence  of  measurements  recorded  at 
all  the  states.  We  estimate  the  probabilities  of  the  states  in 
which  they  were  recorded,  and  take  a  weighted  mean  of  the 
measurements  in  order  to  estimate  the  angular  change  be¬ 
tween  every  two  states.  Thus,  we  are  facing  the  following 
problem:  What  is  the  interpretation  of  a  "mean  angle”? 

As  an  example,  suppose  we  want  to  estimate  the  heading 
change  from  state  a  to  state  h  of  Figure  1.  We  adopt  the 
convention  of  angles  being  expressed  between  -180°  and 
180°.  Also,  suppose  that  the  robot  recorded  two  measure¬ 
ments  of  angular  distance  from  state  a  to  state  h:  - 169°  and 
185°.  The  simple  average  between  these  measurements  is 
an  estimate  of  the  mean  heading  change  of  8°.  Obviously 
this  value  does  not  even  approximate  the  change  of  head¬ 
ing  between  the  two  states.  The  same  problem  arises  if 
we  use  any  other  convention  for  expressing  angles  (e.g.  0° 
to  360°).  The  problem  lies  in  the  fact  that  angles  that  are 
about  180°  away  from  the  mean  angle,  indeed  greatly  de¬ 
viate  from  this  mean,  while  angles  that  deviate  about  360° 
are  actually  very  close  to  it.  To  capture  this  idea,  the  con¬ 
cept  of  circular  distribution  is  required.  We  provide  a  brief 
introduction  to  the  concepts  and  techniques  used  for  han¬ 
dling  directional  data.  In  particular  we  concentrate  on  the 
von  Mises  distribution  —  a  circular  version  of  the  normal 
distribution.  Further  discussion  can  be  found  in  the  statis¬ 
tical  literature  [6, 10, 13].  Section  3.3  returns  to  show  how 
the  theory  is  applied  in  our  model  and  learning  algorithm. 

3.1  STATISTICS  OF  DIRECTIONAL  DATA 

Directional  data  in  the  2-dimensional  space  can  be 
represented  as  a  collection  of  2-dimensional  vectors, 
i{xi,yi),  ■  ■  ■  (x„,  yn)).  on  the  unit  circle,  as  shown  in  Fig¬ 
ure  2.  The  points  can  also  be  represented  as  the  corre¬ 
sponding  angles  between  the  radii  from  the  center  of  the 
unit  circle  and  the  x  axis,  (0i, . . . , On),  respectively.  The 
relationship  between  the  two  representations  is; 

Xi  =  cos(0i),  yi  =  sin(0i) ,  (1  <  i  <  n) . 

The  vector  mean  of  the  n  points,  (x,  y),  is  calculated  as; 

-_ELicos(gi)^  -  _  EtiMej)  (1) 

n  ’  n 

Using  polar  coordinates,  we  can  express  the  mean  vector  in 
terms  of  angle,  9,  and  length,  a,  where  (except  for  the  case 
X  =  y  =  0): 

9  =  arctan(r),  o  =  (x^  ±  y^)*  (2) 

X 

The  angle  9  is  the  mean  angle,  while  the  length  a  is  a 
measure  (between  0  and  1)  of  how  concentrated  me  sample 
angles  are  around  9.  The  closer  a  is  to  1,  me  more  concen¬ 
trated  me  sample  is  around  me  mean,  which  corresponds  to 
a  smaller  sample  variance. 
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Figure  2:  Directional  data  represented  as  angles  and  as  vectors 
on  the  unit  circle. 


Figure  3:  The  von  Mises  distribution  with  mode  0  and  various 
k  values. 


A  function  /  is  a  density  function  of  a  continuous  circular 
distribution  if  and  only  if:  f{x)  >  0  and  f{x)dx  =  1 . 
A  simple  example  of  a  circular  distribution  is  the  uniform 
circular  distribution,  whose  density  function  is  f{6)  =  ^ 
(where  0  is  measured  in  radians). 

One  way  of  deriving  a  circular  version  of  an  unlimited  lin¬ 
ear  distribution  is  through  “wrapping”  it  around  a  circum¬ 
ference  of  the  unit  circle.  If  x  is  a  random  variable  on  the 
line  with  probability  density  function  /(x),  the  wrapped 
random  variable  x^  =  [a:  mod  27r|  is  distributed  according 
to  a  wrapped  distribution  with  the  probability  density  func¬ 
tion:  fw{0)  =  /(^  +  27rA:).  Applying  this  derivation 

to  the  normal  distribution  results  in  a  circular  version  of 
the  normal  distribution,  but  estimating  its  parameters  from 
sample  data  can  be  hard  [6,  13].  An  easier-to-estimate  cir¬ 
cular  version  of  the  normal  distribution  was  derived,  by  von 
Mises  [6,  13].  We  use  this  distribution  to  model  the  robot 
heading  in  this  work,  and  it  is  described  below. 


“unwrapped”  plot  of  the  von  Mises  distribution  for  various 
values  of  k  where  y  =  0. 

We  now  describe  how  to  estimate  the  parameters  y  and  k 
given  a  set  of  heading  samples  (angles  0i, . . .  0„)  from  a 
von  Mises  distribution  [13].  We  are  looking  for  maximum 
likelihood  estimates  for  y  and  k.  The  likelihood  function 
for  the  data  generated  by  a  von  Mises  distribution  with  pa¬ 
rameters  y  and  k  is: 


»=i 


(27r)"/o(k)” 


The  maximum  likelihood  estimate  for  y,  y,  is: 
y  =  arctan(|),  where  y,  x  are  as  defined  in  equation  1. 

The  maximum  likelihood  estimate  for  k  is  the  k  that  solves 
the  equation: 


3.2  THE  VON  MISES  DISTRIBUTION 


A  circular  random  variable,  0,Q  <  6  <2it,  is  said  to  have 
the  von  Mises  distribution  with  parameters  y  and  k,  where 
0  <  /r  <  27r  and  /c  >  0,  if  its  probability  density  function 


is: 


U,k{0)  = 


cos{9—^) 


2-Klo{k) 

where  Io{k)  is  the  modified  Bessel  function  of  the  first  kind 
and  order  0:  oo 


r=0 


r! 


Similar  to  the  linear  normal  distribution,  this  is  a  unimodal 
distribution,  symmetrical  around  y.  The  mode  is  at  0  =  /i 
while  the  antimode  is  at  0  =  /x  -f  tt.  We  observe  that  the  ra¬ 
tio  of  the  density  at  the  mode  to  the  density  at  the  antimode 
is  which  indicates  that  the  larger  k  is,  the  more  con¬ 
centrated  the  density  is  about  the  mode.  Figure  3  shows  an 


=  (3) 

If  we  don’t  know  y  and  are  only  interested  in  estimating 
k  with  respect  to  the  estimate  y,  by  using  trigonometric 
manipulation  and  the  definition  of  a  (Equation  2),  we  can 
substitute  the  right  hand  side  of  equation  3  by  a  and  ob¬ 
tain  that  the  maximum  likelihood  estimate  for  it  is  k  that 
satisfies:  =  o  . 

Io{k) 

However,  if  we  do  have  a  given  y  and  want  to  find  a  max¬ 
imum  likelihood  estimate  for  the  concentration  k  of  the 
sample  data  around  that  specified  y,  we  need  to  use  as  a 
maximum  likelihood  estimate  for  k,  k  that  satisfies: 


/i(fc)_  1 
Io(k)  n\ 


+  (^sin(0j)) -(^sin(^  -  6li)  j  . 
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The  above  estimation  formulae  agree  with  the  intuition  that 
the  sample  is  more  concentrated  (fc  is  larger)  about  the  sam¬ 
ple  mean  (/I)  than  about  the  true  distribution  mean  (fj). 

The  rest  of  the  section  explains  how  the  von  Mises  param¬ 
eters  are  incorporated  into  the  Hidden  Markov  model,  and 
how  the  learning  algorithm  is  adapted  to  learn  these  param¬ 
eters. 

3.3  HANDLING  ANGULAR  ODOMETRIC 
READINGS 

To  model  the  heading  difference  between  each  pair  of 
states,  the  relation  matrix  R,  described  in  Section  2,  is  3- 
dimensional,  consisting  of  the  components  {x,  y,  6).  The 
component  Ri,j[^]  represents  the  heading  change  of  mov¬ 
ing  from  state  Si  to  Sj,  and  is  assumed  to  be  distributed 
according  to  the  von  Mises  distribution.  The  notation 
yP..  n{Ri,j[9])  represents  the  mean  of  the  distribution 
for  this  heading  change,  while  kf  j  =  k{Rij[6])  represents 
the  concentration  parameter  around  the  mean^.  The  three 
constraints  described  before  for  the  components  of  R,  (ide¬ 
ally)  hold  for  the  6  component  as  well. 

Similarly,  every  observed  relation  item,  r*,  in  the  expe¬ 
rience  sequence  E,  has  a  heading-change  component,  6, 
which  records  the  robot’s  estimated  change  in  heading  be¬ 
tween  the  state  at  time  t,  qt,  and  the  state  gt+i- 

The  reestimation  formula  for  the  von  Mises  mean  parame¬ 
ter  of  the  heading  change  between  states  Sj  and  Sj  is: 


I ^[sin(rf[0])6(bi)  -sin(rt[0])^t(i,i)]'\ 


=  arctan 


t=o 

T-2 


^^[cos(rt[0])^t(i,  j)  +  cos(rt[0])Ct(j,i)y 


The  fraction  denotes  the  ratio  between  the  expected  sine 
and  the  expected  cosine  of  the  heading  change  from  state 
i  to  state  j.  Since  the  heading  change  from  j  to  i  is  iden¬ 
tical  in  magnitude  but  opposite  in  direction  to  the  heading 
change  from  i  to  j,  the  transitions  from  j  to  i  are  also  ac¬ 
cumulated  -  with  reversed  signs.  By  taking  arctan  of  this 
ratio  we  get  an  estimate  for  the  mean  heading  change  itself. 

To  reestimate  the  concentration  parameter,  we  need  to  find 
ki ,  such  that: 


^In  contrast,  x  and  y  are  normally  distributed  and  have  their 
variance  rather  than  concentration  stored  in  R. 


Finding  that  satisfies  this  equation  is  done  through  the 
use  of  a  lookup  table  listing  values  of  the  quotient  ^|||. 

The  above  reestimation  formulae  agree  with  the  maximum 
likelihood  estimator  formulae  given  in  Section  3.1.  Their 
correctness  can  be  proved  along  the  lines  of  the  proof  pro¬ 
vided  in  our  previous  document  [16]. 

4  STATE-RELATIVE  COORDINATE 
SYSTEMS 

In  our  previous  work  we  assumed  that  there  is  a  sin¬ 
gle  global  coordinate  system  within  which  the  robot  op¬ 
erates.  Moreover,  we  assumed  that  the  robot  collects  its 
data  within  a  perpendicular  corridor  framework  and  that 
it  takes  advantage  of  this  single  perpendicular  framework 
while  recording  odometric  information.  This  assumption 
may  be  troublesome  in  practice.  The  rest  of  the  paper  dis¬ 
cusses  the  potential  problems,  presents  a  method  for  re¬ 
laxing  the  assumptions  and  addressing  the  problems,  and 
demonstrates  the  effectiveness  of  the  solutions  through  ex¬ 
periments  and  results. 

4.1  MOTIVATION 

We  tend  to  think  about  an  environment  as  consisting  of 
landmarks  fixed  in  a  global  coordinate  system  and  corri¬ 
dors  or  transitions  connecting  these  landmarks.  However, 
this  view  may  be  problematic  when  robots  are  involved. 

Conceptually,  a  robot  has  two  levels  in  which  it  operates; 
the  abstract  level,  in  which  it  centers  itself  through  cor¬ 
ridors,  follows  walls  and  avoids  obstacles,  and  the  phys¬ 
ical  level  in  which  motors  turn  the  wheels  as  the  robot 
moves.  In  the  physical  level  many  inaccuracies  can  oc¬ 
cur:  unaligned  wheels  or  unsynchronized  motors  can  cause 
sidewards  drift,  an  obstacle  under  a  wheel  can  cause  the 
robot  to  slightly  rotate  around  itself,  or  uneven  floors  may 
cause  the  robot  to  slip  in  a  certain  direction.  In  addition, 
the  odometric  measuring  instrumentation  may  be  inaccu¬ 
rate  in  and  of  itself.  In  the  abstract  level,  corrective  actions 
are  constantly  executed  to  overcome  the  physical  drift  and 
drag.  For  example,  if  the  left  wheel  is  disaligned  and  drags 
the  robot  leftwards,  a  corrective  action  of  moving  to  the 
right  is  constantly  taken  in  the  higher  level  to  keep  the  robot 
centered  in  the  corridor. 

Such  phenomena  greatly  effect  the  odometry  recorded  by 
the  robot,  if  it  is  interpreted  with  respect  to  one  global 
framework.  For  example,  consider  the  robot  depicted  in 
Figure  4.  It  drifts  to  the  left  -0°  when  moving  from  one 
state  to  the  next,  and  corrects  for  it  by  moving  4>°  to  the 
right  to  maintain  itself  centered  in  the  corridor,  moving 
along  the  solid  arrow.  Let  us  assume  that  states  are  lo- 
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Figure  4:  The  robot  moves  in  a  corridor  along  the  solid  arrow, 
correcting  for  drift  in  the  direction  of  the  dashed  arrow. 

cated  along  the  center  of  the  corridor,  which  is  aligned 
with  the  y  axis  of  the  global  coordinate  system.  The  robot 
steps  back  and  forth  in  the  corridor.  Whenever  it  reaches 
a  state,  its  odometry  reading  changes  by  {x,  y,  6)  along  the 
{X,Y,  heading  )  dimensions,  respectively.  As  the  robot 
proceeds,  the  deviation  with  respect  to  the  x  axis  becomes 
more  and  more  severe.  Thus,  after  going  through  several 
transitions,  the  odometric  changes  recorded  between  every 
pair  of  states,  with  respect  to  a  global  coordinate  system, 
become  larger  and  larger  (especially  in  the  X  dimension). 

Similar  problems  of  inconsistent  odometric  changes 
recorded  between  pairs  of  states  can  arise  along  any  of  the 
odometric  dimensions.  It  is  especially  severe  when  such 
inconsistencies  arise  with  respect  to  the  heading,  since  this 
can  lead  to  confusion  between  the  X  and  the  Y  axes,  as 
well  as  confusion  between  forwards  and  backwards  move¬ 
ment  (when  the  deviation  in  the  heading  is  around  90°  or 
180°  respectively).  An  example  of  our  robot  view  of  a  per¬ 
fectly  perpendicular  office  environment,  based  on  its  odo¬ 
metric  readings  within  a  global  coordinate  system,  is  shown 
in  Figure  5.  The  data  was  collected  by  our  robot  Ramona, 
while  moving  along  the  corridors  in  an  area  of  our  depart¬ 
ment,  depicted  in  Figure  7. 

A  solution  to  such  a  situation  is  to  model  the  odometric  re¬ 
lations  of  moving  from  state  Sj  to  state  Sj  using  a  changing 
coordinate  system  which  is  respective  to  state  Sj,  as  op¬ 
posed  to  a  global  coordinate  system  anchored  at  the  initial 
state.  We  formalize  this  idea  and  provide  the  update  rules 
for  the  odometric  information  based  on  this  approach  in  the 
rest  of  this  section.  We  have  implemented  our  solution,  and 
demonstrate  its  effectiveness  throughout  Section  5. 

4.2  LEARNING  ODOMETRIC  RELATIONS  WITH 
CHANGING  COORDINATES 

As  before,  our  experience  sequence  E  consists  of  T  pairs 
of  recorded  odometric  relations  and  observation 
vectors.  The  odometric  relations  are  still  recorded  with  re¬ 
spect  to  the  robot’s  global  coordinate  system.  However, 
when  learning  the  relation  matrix  from  the  odometric  read¬ 
ings,  we  interpret  the  entry  Rij  in  the  relation  matrix  R,  as 
encoding  the  information  with  respect  to  a  coordinate  sys- 


Figure  5:  A  path  in  a  perpendicular  environment,  plotted  based 
on  odometric  readings  taken  by  the  robot  Ramona. 


Figure  6:  Robot  in  state  Si,  facing  in  the  direction  of  the  y  axis. 

tern  whose  origin  is  anchored  at  the  state  s^;  the  y  axis  is 
aligned  with  the  robot’s  heading  in  state  Si  and  the  x  axis  is 
perpendicular  to  it.  This  is  depicted  in  figure  6.  'The  robot 
is  in  state  Sj  facing  in  the  direction  pointed  to  by  the  y  axis. 
Its  relationship  to  the  state  Sj  is  described  in  terms  of  the 
coordinate  system  shown  in  the  figure.  Its  heading  in  each 
state  is  denoted  by  the  bold  arrow. 

To  support  this  interpretation  of  the  relation  matrix  we  need 
to  revisit  the  formulation  of  the  geometrical-consistency 
constraints  stated  in  Section  2,  as  well  as  the  update  for¬ 
mulae  used  when  learning  the  model. 

The  consistency  constraints  have  to  reflect  the  coordi¬ 
nate  system  with  respect  to  which  the  odometry  is  repre¬ 
sented.  Since  the  heading  measurement  is  independent  of 
any  specific  coordinate  system,  only  the  constraints  over 
the  X  and  y  components  of  the  odometric  relation  need 
to  be  redefined.  We  denote  by  the  vector 

(m(-Ro,(.N),  /i(Ra,(>[2/])).  Let  us  define  Tab  to  be  the  trans¬ 
formation  which  maps  an  {xa,ya)  pair  represented  with  re¬ 
spect  to  the  coordinate  system  of  state  a,  to  the  same  pair 
represented  with  respect  to  the  coordinate  system  of  state 
b,  (xb,  yb),  (note  that  Tab  =  T^~^). 

More  explicitly,  as  before,  let  /i®(a,  b)  be  the  mean  change 
in  heading  from  state  a  to  state  b  (recall  that  M^(a,b)  = 
-p^{b,  a)).  The  transformation  Tab  is  defined  as  follows: 

Xb  f  Xa  ^  XaC.os{p^{a,b)) -yaS\n{pP{a,b))' 

Vb J  [yaj  )  [xa  sin(/r®(a,  6))  -I-  cos(/i®(a,  b)) 

We  can  now  redefine  the  consistency  constraints  for  the  x 
and  y  components  of  the  odometric  relation: 
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Figure  7:  Model  of  a  prescribed  path  through  a  true  hallway  Figure  8:  Learned  topological  model, 

environment. 


o  =  (0,0); 

o  n^^’y'>{a,b)  =  -Tba  “))  (antisymmetry)-, 

o  =  fj,<^'y'>{a,b)+Tba  {n<^'y\b,c)) (additivity); 

The  reestimation  formulae  for  all  the  parameters  except  for 
the  X  and  y  components  of  the  relation  matrix  R,  remain  as 
before.  However,  the  reestimation  formulae  for  the  x  and 
y  parameters  are  changed  to  reflect  the  relative  coordinate 
systems  used,  /tf  ^  and  fif  j  are  reestimated  as  follows; 

t=0 

These  reestimation  rules  are  guaranteed  to  satisfy  the  first 
two  geometrical  constraints,  but  not  the  additivity  con¬ 
straint.  Their  correctness  can  be  proved  along  the  lines  of 
the  correctness  proofs  for  all  other  formulae  [16]. 

5  EXPERIMENTS  AND  RESULTS 

The  goal  of  this  work  is  to  use  odometry  to  improve  the 
learning  of  topological  models,  while  using  fewer  iterations 
and  less  data.  We  tested  our  algorithm  in  a  simple  robot- 
navigation  world.  In  earlier  stages  of  this  work,  a  strong 
assumption  underlay  our  experiments:  the  corridors  in  the 
environment  are  all  perpendicular  to  each  other,  and  the 
agent  was  using  this  perpendicularity  to  reset  its  position 
while  accumulating  the  odometric  readings.  Here  we  have 
updated  the  algorithm  and  dropped  the  assumption.  The  ex¬ 
periments  demonstrate  that  the  use  of  odometry,  even  with 
accumulated  rotational  error  and  without  using  the  perpen¬ 
dicularity  assumption,  is  still  very  beneficial. 

5.1  EXPERIMENTAL  SETTING 


hallways  from  intersection  to  intersection  and  to  turn  ninety 
degrees  to  the  left  or  right.  Ultrasonic  data  interpretation 
let  her  perceive,  in  three  directions  —  front,  left  and  right 
-  whether  there  is  an  open  space,  a  door,  a  wall,  or  some¬ 
thing  unknown.  Doors  and  intersections  constitute  states. 
When  they  are  detected  by  Ramona,  it  stops  and  records  its 
observations,  as  well  as  its  odometric  change  between  the 
previous  and  the  current  state.  All  recorded  measures  as 
well  as  the  actions  are,  of  course,  subject  to  error. 


The  path  Ramona  followed  consists  of  4  connected  corri¬ 
dors,  which  include  17  states,  as  shown  in  Figure  7.  Black 
dots  represent  the  physical  locations  of  states.  Multiple 
states  (depicted  as  numbers  in  the  plot)  associated  with  a 
single  location  correspond  to  different  orientations  of  the 
robot  at  that  location.  The  larger  black  circle,  at  the  bottom 
left  comer,  represents  the  starting  position.  The  observa¬ 
tions  associated  with  each  state  are  omitted  for  clarity.  A 


projection  of  the  odometric  readings  that  Ramona  recorded 


along  the  x  and  y  dimensions,  is  shown  in  figure  5. 


To  statistically  evaluate  our  algorithm,  we  use  a  simulated 
office  environment  in  which  the  robot  follows  a  prescribed 
path.  It  is  represented  as  an  HMM  consisting  of  44  states, 
and  the  associated  transition,  observation,  and  odometric 
distributions.  Figure  9  depicts  this  HMM.  Arrows  repre¬ 
sent  transitions  that  have  probability  0.2  or  higher.  Solid 
arrows  represent  the  most  likely  transitions  between  the 
states.  We  generated  5  data  sequences  from  the  model,  each 
of  length  800,  using  Monte  Carlo  sampling.  One  of  these 
sequences  is  depicted  in  Figure  10.  Again,  observations  are 
omitted,  and  this  is  a  projection  of  the  odometry  readings 
onto  a  global  2-dimensional  coordinate  system.  For  each 
sequence  we  ran  our  algorithm  10  times.  We  also  ran  the 
standard  Baum- Welch  algorithm,  not  using  odometric  in¬ 
formation,  10  times  on  each  sequence.  For  both  algorithms 
we  started  each  run  from  a  randomly  picked  initial  model. 


Our  experiments  use,  both  real  robot  data  and  simulated 
data.  We  ran  our  robot  Ramona,  a  modified  RWI  B21, 
along  a  prescribed*  directed  path  in  our  department  corri¬ 
dors.  Low-level  routines  let  Ramona  move  forward  through 

‘*Hence,  no  decisions  are  executed  by  the  robot,  and  the  model 
is  an  HMM  and  not  a  complete  pomdp. 


5.2  RESULTS 

We  used  our  algorithm  to  learn  a  topological  model  of  the 
environment  from  the  data  gathered  by  Ramona.  Figure  8 
shows  the  topology  of  one  typical  learned  HMM.  The  bold 
circle  represents  the  initial  state.  The  arrows  semantics  is 
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Figure  9:  Model  of  a  prescribed  path  through  the  simulated 
hallway  environment. 


Figure  10:  A  data  sequence  generated  by  our  simulator. 


as  stated  before.  It  is  clear  that  the  learned  topology  corre¬ 
sponds  well  to  the  topology  of  the  true  environment.  The 
observation  distributions  learned  are  omitted  from  the  fig¬ 
ure,  but  they  too  correspond  well  to  the  walls,  doors  and 
openings  encountered  along  the  path,  while  incorporating 
the  identification  error  resulting  from  noisy  sensors. 

Traditionally,  in  simulation  experiments,  learned  models 
are  quantitatively  compared  to  the  actual  model  that  gen¬ 
erated  the  data.  Each  of  the  models  induces  a  probabil¬ 
ity  distribution  on  strings  of  observations;  the  asymmetric 
Kullback-Leibler  divergence  [11]  between  the  two  distri¬ 
butions  is  a  measure  of  how  far  the  learned  model  is  from 
the  true  model.  We  report  our  simulation  results  in  terms 
of  a  sampled  version  of  the  KL  divergence,  as  described  by 
Juang  and  Rabiner  [9].  It  is  based  on  generating  sequences 
of  sufficient  length  according  to  the  distribution  induced 
by  the  true  model,  and  comparing  their  likelihoods  accord¬ 
ing  to  the  learned  model  with  the  true  model  likelihoods. 
We  ignore  the  odometry  information  when  applying  the  KL 
measure,  thus  allowing  comparison  between  purely  topo¬ 
logical  models  that  are  learned  with  and  without  odometry. 

Table  1  lists  the  KL  divergence  between  the  true  and  learned 
model,  as  well  as  the  number  of  runs  until  convergence  was 
reached,  for  each  of  the  5  simulation  sequences  under  the 
two  learning  settings,  averaged  over  10  runs  per  sequence. 

The  table  demonstrates  that  the  KL  divergence  with  respect 
to  the  true  model  for  models  learned  using  odometry,  is 
about  4-5  times  smaller  than  for  models  learned  without 
odometric  data.  To  check  the  significance  of  our  results 


Table  1:  Average  results  of  2  learning  settings  with  5  training 
sequences. 


1  Seq.  # 

1 

2 

3 

4 

5 

With 

KL 

1.115 

1.100 

1.095 

1.139 

1.129 

Odo 

Iter# 

69.7 

81.8 

84.3 

52.4 

112.9 

No 

KL 

5.575 

4.499 

4.997 

4.491 

5.791 

Odo 

Iter# 

120.4 

107.5 

116.2 

113.3 

120.6 

Figure  11:  Average  KL-divergcnce  as  a  function  of  length. 

we  used  the  simple  two-sample  t-test.  The  models  learned 
using  odometric  information  have  highly  statistically  sig¬ 
nificantly  {p  »  0.9995)  lower  average  KL  divergence  than 
the  others. 

In  addition,  the  number  of  iterations  required  for  con¬ 
vergence  when  learning  using  odometric  information  is 
smaller  than  required  when  ignoring  such  information. 
Again,  the  t-test  verifies  the  significance  (p  >  0.995)  of 
this  result. 

To  examine  the  influence  of  the  amount  of  data  on  the  qual¬ 
ity  of  the  learned  models,  we  took  one  of  the  5  sequences 
(Seq.  #1)  and  used  its  prefixes  of  length  100  to  800  (the 
complete  sequence),  in  increments  of  100,  as  individual  se¬ 
quences.  We  ran  the  two  algorithmic  settings  over  each  of 
the  8  prefix  sequences,  5  times  repeatedly.  We  then  used 
the  KL-divergence  as  described  above  to  evaluate  each  of 
the  resulting  models  with  respect  to  the  true  model.  For 
each  prefix  length  we  averaged  the  KL-divergence  over  the 
5  runs.  Table  2  summarizes  the  results  of  this  experiment. 
It  lists  the  mean  KL-divergence  over  the  5  runs  for  each  of 
the  prefixes,  as  well  as  the  standard  deviation  around  this 
mean.  The  plot  in  Figure  1 1  depicts  the  KL-divergence  as 
a  function  of  the  sequence  length  for  each  of  the  settings. 
Both  the  table  and  the  plot  demonstrate  that,  in  terms  of  the 
KL-divergence,  our  algorithm,  which  uses  odometric  infor¬ 
mation,  is  robust  in  the  face  of  data  reduction.  In  contrast, 
learning  without  the  use  of  odometry  is  much  more  sensi- 
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Table  2:  Average  results  with  8  incrementally  longer  sequences. 


1  Seq.  Length  I 

800 

700 

600 

500 

400 

300 

200 

100 

With 

Mean  KL 

1.136 

1.201 

1.191 

1.241 

1.216 

1.272 

1.771 

15.076 

Odo 

Std.  Dev. 

0.091 

0.083 

0.131 

0.082 

0.036 

0.085 

0.510 

12.884 

No 

Mean  KL 

5.790 

6.249 

8.354 

10.390 

11.490 

14.772 

20.044 

26.619 

Odo 

Std.  Dev. 

0.554 

0.937 

0.179 

0.460 

0.422 

1.280 

0.904 

0.460 

tive  to  reduction  in  the  amount  of  data.  Again,  we  applied 
the  two-sample  t-test,  which  verified  the  statistical  signifi¬ 
cance  of  these  results. 

6  CONCLUSIONS 

Directional  information  which  comes  up  in  various  appli¬ 
cations  of  computer  science  in  general  and  machine  learn¬ 
ing  in  particular,  requires  special  treatment.  Currently  most 
statistical  models  and  applications  are  based  on  distribu¬ 
tions  that  are  either  discrete  or  continuous  along  the  real 
line,  rather  than  circular.  It  is  important  to  be  aware  of  the 
need  for  circular  distributions  as  well  as  of  their  existence. 
Moreover,  it  would  be  useful  to  have  widely  used  applica¬ 
tions  such  as  Autoclass  [2]  support  such  distributions. 

A  problematic  aspect  of  directional  data  which  manifests 
itself  when  learning  maps  and  models  for  robot  navigation 
is  that  of  cumulative  rotational  errors.  In  the  context  of 
our  work  we  have  demonstrated  that  the  use  of  relative  co¬ 
ordinate  systems  rather  than  global  ones  supports  learning 
relationship  between  states.  The  main  point  shown  by  this 
paper  is  that  through  correct  treatment  of  directional  data, 
odometric  information  which  is  weak  and  very  noisy  still 
provides  a  significant  leverage  when  learning  a  purely  topo¬ 
logical  map. 
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Abstract 

An  important  and  difficult  prediction  task 
in  many  domains,  particularly  medical  deci¬ 
sion  making,  is  that  of  prognosis.  Progno¬ 
sis  presents  a  unique  set  of  problems  to  a 
learning  system  when  some  of  the  outputs 
are  unknown.  This  paper  presents  a  new  ap¬ 
proach  to  prognostic  prediction,  using  ideas 
from  nonparametric  statistics  to  fully  utilize 
all  of  the  available  information  in  a  neural  ar¬ 
chitecture.  The  technique  is  applied  to  breast 
cancer  prognosis,  resulting  in  flexible,  accu¬ 
rate  models  that  may  play  a  role  in  prevent¬ 
ing  unnecessary  surgeries. 

1  Introduction 

This  paper  applies  artificial  neural  network  classifica¬ 
tion  to  the  analysis  of  survival  or  lifetime  data  (Lee, 
1992),  in  which  the  objective  can  be  broadly  defined 
as  predicting  the  future  time  of  a  particular  event.  In 
this  work  we  are  concerned  specifically  with  prognosis, 
that  is,  predicting  the  course  of  a  disease.  These  meth¬ 
ods  are  applied  to  breast  cancer  prognosis,  predict¬ 
ing  how  long  after  surgery  we  can  expect  the  disease 
to  recur.  This  problem  has  significant  clinical  impor¬ 
tance.  Decisions  regarding  chemotherapy  its  intensity 
are  based  on  the  anticipated  course  of  the  cancer.  For 
example,  patients  with  favorable  outlooks  may  forego 
chemotherapy  entirely.  Those  with  less  favorable  out¬ 
looks  may  undergo  varying  intensities  of  chemother¬ 
apy,  or  even  bone  marrow  transplantation. 

Prognostic  prediction  does  not  fit  comfortably  into  ei¬ 
ther  of  the  classic  learning  paradigms  of  function  ap¬ 
proximation  or  classification.  While  a  patient  can  be 
classified  “recur”  if  the  disease  is  observed,  there  is 


no  real  cutoff  point  at  which  the  patient  can  be  con¬ 
sidered  a  non-recurrent  case.  The  data  are  therefore 
censored  in  that  we  know  a  time  to  recur  for  only 
a  subset  of  patients.  For  the  others,  we  know  only 
the  time  of  their  last  check-up,  or  disease-free  survival 
time  (DFS).  In  particular,  recurrence  or  survival  data 
is  right  censored,  i.e.,  the  right  endpoint  (recurrence 
time)  is  sometimes  unknown,  since  some  patients  will 
inevitably  move  away,  change  doctors,  or  die  of  un¬ 
related  causes.  Therefore,  in  many  cases,  the  train¬ 
ing  signal  for  the  learning  method  is  not  well-defined. 
Prognosis  is  not  viewed  here  as  a  time-series  predic¬ 
tion  problem,  since  the  predictive  features  are  gathered 
only  once,  at  the  time  of  diagnosis  and/or  surgery. 

Problems  involving  censored  data  are  common  to  sev¬ 
eral  fields.  In  engineering,  one  might  be  interested 
in  the  survival  characteristics  of  electronic  compo¬ 
nents,  while  sociologists  might  consider  what  factors 
lead  to  long-lasting  marriages.  These  problems  have 
traditionally  been  approached  using  statistical  tech¬ 
niques  such  as  Cox  proportional-hazards  regression 
(Cox,  1972).  In  recent  years,  there  has  been  an  in¬ 
creased  interest  in  the  application  of  machine  learn¬ 
ing  methods  to  prediction  using  censored  data.  Sev¬ 
eral  groups  have  approached  prognosis  as  a  separation 
problem  using  different  learning  architectures,  includ¬ 
ing  backpropagation  artificial  neural  networks  (ANNs) 
(Burke,  1994;  Burke  et  al.,  1997),  entropy  maximiza¬ 
tion  networks  (Choong  et  ah,  1996)  and  decision  trees 
(Wolberg  et  ah,  1992;  Wolberg  et  ah,  1994).  This  is 
done  by  choosing  one  or  more  endpoints  and  learning 
a  yes/no  classifier  on  concepts  such  as  “patients  who 
recurred  in  less  than  two  years.”  Cases  with  follow¬ 
up  time  less  than  the  cutoff  are  discarded  from  the 
training  set.  Ravdin  and  colleagues  (De  Laurentiis  and 
Ravdin,  1994;  Ravdin  and  Clark,  1992)  use  ANNs  to 
generate  survival  curves,  which  plot  the  probability  of 
disease-free  survival  against  time.  This  work  uses  time 
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as  an  input  variable  and  interprets  the  trained  net¬ 
work’s  single  output  as  an  approximation  of  recurrence 
probability.  The  resulting  formulation  results  in  biases 
in  the  training  data  that  must  be  corrected  by  repeat¬ 
ing  or  removing  some  of  the  examples.  Their  com¬ 
putational  results  are  verified  only  by  demonstrating 
that  their  predicted  survival  rates  closely  approximate 
those  of  the  test  cases.  The  problem  has  also  been 
approached  in  an  unsupervised  learning  fashion,  using 
clustering  (Bradley  et  ah,  1997)  and  self-organizing 
neural  networks  (Schenone  et  ah,  1993).  However, 
these  techniques  did  not  directly  address  the  problem 
of  prediction  using  censored  data. 

While  this  research  also  separates  the  cases  into  classes 
based  on  recurrence  time,  it  differs  from  the  above 
techniques  in  several  respects.  Censored  cases  are  in¬ 
corporated  directly  into  the  training  set,  not  by  us¬ 
ing  an  artificial  cutoff  time,  but  rather  by  using  the 
probability  that  they  will  recur  before  a  certain  time 
as  the  training  signal.  In  this  way  we  use  all  of  the 
information  available  in  the  training  set.  Further,  in¬ 
terpreting  the  outputs  as  probabilities  lets  us  not  only 
separate  the  cases  into  “good”  and  “bad”  prognoses, 
but  also  to  generate  predicted  survival  curves  for  in¬ 
dividual  patients,  making  the  system  more  useful  in  a 
clinical  setting. 

2  Neural  Architecture 

The  ANNs  used  in  this  work  were  standard  feedfor¬ 
ward  networks  with  one  hidden  layer,  trained  with 
backpropagation  (Rumelhart  et  al.,  1986).  The  hy¬ 
perbolic  tangent  activation  function  was  used  for  hid¬ 
den  and  output  nodes.  The  output  layer  consisted  of 
ten  units;  the  first  represented  the  class  of  examples 
with  recurrences  at  one  year  or  less  following  surgery, 
the  second  those  with  recurrences  between  one  and 
two  years,  etc.,  up  to  ten  years^  This  approach  im¬ 
plies  the  existence  of  an  extra  (in  our  case,  eleventh) 
class.  These  are  the  patients  with  expected  disease- 
free  survival  of  time  greater  than  the  length  of  the 
study  (10  years).  The  activations  of  the  output  units 
were  trained  with  and  interpreted  as  the  probability 
that  the  patient  would  have  disease-free  survival  up  to 
that  time.  These  probabilities  were  scaled  to  the  range 
of  the  hyperbolic  tangent  function,  i.e.,  activation  =  2 
*  probability  -  1. 

In  order  to  maintain  the  interpretation  of  the  out¬ 
puts  as  probabilities,  the  relative  entropy  error  func- 

^The  available  prognostic  studies  are  approximately  ten 
years  in  duration. 


tion  (Baum  and  Wilczek,  1988;  Solla  et  al.,  1988)  was 
used  for  all  non-input  units.  For  a  given  example  i, 
this  error  function  is  defined  as 


Tf)log 


l-Tf 


1-0? 


where  Tf  is  the  target  value  for  output  unit  k  and  0? 
is  its  output  value.  Outputs  of  -1-1  and  -1  correspond 
to  definitely  true  and  definitely  false,  respectively,  with 
intermediate  values  again  being  scaled  into  the  appro¬ 
priate  range. 

For  recurrent  cases,  the  network  was  trained  with  val¬ 
ues  of  -1-1  for  all  outputs  up  to  the  observed  recurrence 
time,  and  -1  thereafter.  For  instance,  a  recurrence  at 
32  months  would  have  a  training  vector  T  =  {1,  1,  -1, 
-1,  -1,  -1,  -1,  -1,  -1,  -1}.  The  value  of  the  probability 
formulation  is  seen  in  the  censored  cases.  They  were 
similarly  trained  with  values  of  -t-1  up  to  the  observed 
disease-free  survival  time.  The  probabilities  of  DFS 
for  later  times  were  computed  using  a  variation  of  the 
standard  Kaplan-Meier  maximum  likelihood  approxi¬ 
mation  to  the  true  population  survival  rate  (Kaplan 
and  Meier,  1958).  We  define  the  risk  of  recurrence  at 
time  t  >  0  as  the  conditional  probability  that  a  patient 
will  recur  at  time  t,  given  that  they  have  not  recurred 
up  to  time  t-1.  As  an  example,  consider  a  study 
containing  a  total  of  20  patients.  If  two  recurrences 
were  observed  in  the  first  time  interval,  we  would  have 
riski  =0.1.  Further  suppose  that  the  study  has  two 
censored  cases  in  the  first  time  interval,  and  two  more 
recurrences  in  the  second  interval.  There  are  16  pa¬ 
tients  at  risk  for  recurrence  during  interval  two,  with 
two  recurrences,  so  risk2  =  0.125.  The  Kaplan-Meier 
estimator  of  the  disease-free  survival  curve,  S,  tracks 
the  cumulative  probability  of  DFS  for  any  time  in  the 
study,  using  the  risks  in  the  following  fashion: 


.-fl.  t  =  o 

\  -  riskt),  t>0. 

Continuing  the  above  example.  So  =  1-0,  Si  =  0.9, 
and  S2  =  0.7875.  To  compute  appropriate  training 
probabilities,  we  simply  use  the  DFS  time  of  the  cen¬ 
sored  case  as  the  starting  time,  rather  than  time  0: 


r  1,  0<t<DFS(i) 

I  St_i(l  -  riskt),  t  >  DFS{i). 


For  an  individual  output  node  k,  this  training  signal 
represents  the  example’s  probability  of  membership  in 
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the  class  being  recognized  by  that  node,  i.e.,  the  set  of 
cases  that  recur  before  the  end  of  year  k.  Collectively, 
the  activation  values  of  the  output  units  represent  an 
expected  survival  curve  for  the  individual  case. 

If  we  view  the  network  as  learning  a  survival  curve,  the 
task  becomes  one  of  function  approximation  using  in¬ 
complete  data.  The  training  signal  is  then  a  modified 
thermometer  encoding  (McCullagh  and  Nelder,  1989), 
a  relatively  common  encoding  for  ordered  categorical 
outputs,  with  the  added  complication  of  the  survival 
probabilities  for  censored  cases.  Since  the  effects  of 
some  of  the  input  features  are  thought  to  be  nonlinear 
over  time,  it  is  also  instructive  to  view  the  problem 
as  a  sequence  of  highly  related  but  distinct  classifi¬ 
cation  problems,  all  learned  using  the  same  internal 
representation  (i.e.,  hidden  nodes).  The  representa¬ 
tion  generated  in  learning  one  group  (say,  those  cases 
that  are  likely  to  recur  before  one  year)  contributes  to 
the  learning  of  other  groups  (say,  those  cases  recurring 
between  5  and  6  years).  This  is  a  form  of  functional 
knowledge  transfer,  similar  to  the  MTL  network  (Bax¬ 
ter,  1995;  Caruana,  1995).  The  learning  of  multiple 
classes  in  parallel  contributes  to  faster  learning  and 
more  reliable  predictive  models. 

The  above  architecture  facilitates  three  different  uses 
of  the  resulting  predictive  model: 

1.  The  output  units  can  be  divided  into  groups  a 
posteriori  to  separate  good  from  poor  prognoses. 
For  a  particular  application,  any  prediction  of  re¬ 
currence  at  a  time  greater  than  five  years  might  be 
considered  favorable,  and  indicate  less  aggressive 
treatment.  The  actual  outcomes  of  those  patients 
in  the  good  group  should  be  significantly  better 
than  those  in  the  poor  group. 

2.  An  individualized  disease-free  survival  curve  can 
easily  be  generated  for  a  particular  patient  by 
plotting  the  probabilities  predicted  by  the  vari¬ 
ous  output  units.  In  order  for  this  curve  to  be 
reliable,  the  activations  should  be  monotonically 
decreasing,  or  very  nearly  so. 

3.  The  expected  time  of  recurrence  can  be  obtained 
merely  by  noting  the  first  output  unit  that  pre¬ 
dicts  a  probability  of  disease-free  survival  of  less 
than  0.5.  This  provides  a  convenient  method 
of  rank-ordering  the  cases  according  to  expected 
outcome. 

A  significant  methodological  issue  is  that  of  evaluating 
the  learned  model.  As  discussed  earlier,  this  is  neither 


a  function  approximation  nor  a  classification  problem, 
since  in  many  cases  we  do  not  know  the  correct  an¬ 
swer.  Still,  there  is  a  well-defined  goal:  the  accurate 
prediction  of  individual  prognosis.  While  our  training 
method  seeks  to  minimize  the  relative  entropy  error  at 
each  output  unit,  the  reporting  of  this  error  on  testing 
sets  would  be  relatively  uninformative.  We  therefore 
evaluate  the  models  on  two  criteria:  the  accuracy  of 
the  predicted  recurrence  rates  (see  Section  3.4)  and  the 
ability  of  the  models  to  separate  cases  with  favorable 
and  unfavorable  prognoses  (see  Section  3.3). 

3  Experimental  Results 

Computational  experiments  were  performed  on  two 
very  different  breast  cancer  data  sets.  The  first 
is  known  as  Wisconsin  Prognostic  Breast  Cancer 
(WPBC)  and  is  characterized  by  a  small  number  of 
cases,  relatively  high  dimensionality,  very  precise  val¬ 
ues  and  almost  no  missing  data.  The  second  data  set  is 
from  the  Surveillance,  Epidemiology,  and  End  Results 
(SEER)  program  of  the  National  Cancer  Institute.  It 
contains  a  large  number  of  cases,  with  relatively  few, 
coarsely-measured  features,  and  a  high  percentage  of 
missing  values.  Details  on  these  data  sets  are  given 
below. 

In  both  cases,  the  prognosis  data  used  in  this  study 
consists  of  those  malignant  patients  for  which  follow¬ 
up  data  was  available,  after  eliminating  those  cases 
with  distant  metastasis  (cancer  has  already  spread; 
prognosis  is  poor)  and  carcinoma  in  situ  (cancer  has 
not  yet  invaded  breast  tissue;  prognosis  is  good).  We 
therefore  maximize  the  clinical  relevance  of  the  study 
by  focusing  on  those  cases  that  present  the  most  diffi¬ 
cult  prognosis. 

Experiments  reported  in  this  section  are  test  set  results 
using  either  tenfold  cross-validation  (WPBC  data)  or 
a  single  randomized  holdout  test  (SEER  data).  The 
ANNs  used  had  three  hidden  units,  and  training  was 
terminated  after  1000  on-line  epochs. 

3.1  Wisconsin  Prognostic  Breast  Cancer 
Data 

In  previous  work  (Mangasarian  et  al.,  1995;  Wolberg 
et  al.,  1994)  the  author  contributed  to  the  development 
of  an  image-processing  software  package  for  breast  can¬ 
cer  diagnosis,  known  as  Xcyt,  which  analyzes  digital 
images  of  cells  taken  from  breast  lumps.  This  program 
computes  10  different  features  of  each  cellular  nuclei 
in  the  image:  radius,  perimeter,  area,  compactness. 
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smoothness,  size  and  number  of  concavities,  symmetry, 
fractal  dimension,  and  texture.  The  mean,  standard 
area,  and  extreme  values  of  each  feature  are  computed 
for  each  image.  The  current  application  uses  the  30 
nuclear  features  computed  by  Xcyt  together  with  two 
traditional  prognostic  predictors:  tumor  size  and  num¬ 
ber  of  involved  lymph  nodes.  This  data  set  contains 
227  cases,  61  of  which  have  recurred.  An  earlier  ver¬ 
sion  of  this  data  set  is  available  at  the  UCI  machine 
learning  repository  (Merz  and  Murphy,  1996). 

3.2  SEER  Data 

The  SEER  (Carter  et  al.,  1989)  data  set  consists  of 
data  on  cancer  survival  (rather  than  recurrence)  for 
over  38,000  women  newly  diagnosed  with  breast  can¬ 
cer  between  1977  and  1982.  Each  case  contains  the 
following  information:  histological  grade  (four  discrete 
values),  tumor  size,  tumor  extent  (5  discrete  values), 
number  of  positive  lymph  nodes,  and  number  of  nodes 
examined.  Many  of  these  feature  values  are  missing. 
For  instance,  only  about  20%  of  the  cases  contain  a 
value  for  histological  grade;  over  1200  of  the  cases  con¬ 
tain  no  feature  information  at  all.  Each  of  the  SEER 
features  was  encoded  as  a  sequence  of  binary  variables, 
with  an  additional  binary  variable  representing  a  miss¬ 
ing  value. 

3.3  Good  vs.  poor  prognoses 

To  be  used  as  a  clinical  tool,  the  predictive  model 
should  reliably  separate  cases  with  a  good  prognosis 
from  those  with  a  poor  prognosis.  Since  treatment  op¬ 
tions  are  limited,  this  sort  of  stratification  could  be 
most  helpful  to  the  physician  and  the  patient  in  de¬ 
termining  a  post-operative  treatment  plan.  Figure  1 
stratifies  the  WPBC  test  cases  into  those  predicted 
to  recur  in  the  first  five  years  and  those  predicted  to 
recur  at  some  time  greater  than  five  years  (including 
the  implicit  11th  class).  The  difference  in  these  two 
groups  is  statistically  significant  (p  <  0.001,  general¬ 
ized  Wilcoxon  test).  Of  course,  the  output  units  could 
be  grouped  differently  to  define  the  relevant  prognos¬ 
tic  categories  for  a  particular  problem.  Further,  the 
implicit  final  group  could  also  be  subdivided  based  on 
the  activation  level  of  the  last  node. 

Similarly,  Figure  2  shows  survival  probabilities  for 
those  cases  with  good  and  poor  prognosis,  in  this  case, 
predicted  survival  less  than  or  equal  to  ten  years  and 
predicted  survival  greater  than  ten  years.  Again  the 
difference  in  the  two  groups  is  statistically  significant 
(p  <  0.001).  The  difference  in  dividing  points  between 


Figure  1:  WPBC  Data:  Disease-free  survival  probabil¬ 
ities  for  those  cases  predicted  to  recur  in  the  first  five 
years  (Poor,  58  cases)  compared  to  those  predicted  to 
recur  at  some  time  greater  than  five  years  (Good,  169 
cases). 


the  two  tests  is  due  to  the  difference  between  the  mea¬ 
sured  endpoints  (recurrence  in  the  WPBC  data,  death 
in  the  SEER  data).  The  ratios  of  good  to  bad  prog¬ 
noses  were  held  nearly  constant. 


Figure  2:  SEER  Data:  Survival  probabilities  for  those 
cases  predicted  to  die  from  breast  cancer  in  the  first  ten 
years  (Poor,  8,353  cases)  compared  to  those  predicted 
to  die  at  some  time  greater  than  ten  years  (Good, 
26,192  cases). 
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In  traditional  breast  cancer  staging,  postoperative 
treatment  decisions  are  based  largely  or  even  entirely 
on  whether  or  not  the  cancer  has  spread  to  the  pa¬ 
tient’s  axillary  lymph  nodes.  However,  removing  the 
nodes  for  examination  leaves  the  arm  subject  to  infec¬ 
tion  and  possible  lymphedema  (Aitken  et  ah,  1989), 
and  does  not  affect  overall  survival  (Abe  et  ah,  1995). 
In  both  of  our  test  cases,  the  separation  with  this 
method  represents  an  improvement  over  that  achieved 
by  the  lymph  node  status  feature.  Further,  statisti¬ 
cally  significant  separation  was  achieved  in  both  data 
sets  without  using  the  lymph  feature  (WPBC,  p 
0.02;  SEER,  p  <  0.001).  This  is  further  confirmation 
of  a  previous  finding  (using  other  analytic  techniques) 
that  breast  cancer  prognosis  can  be  achieved  without 
lymph  node  dissection  (Wolberg  et  ah,  1997;  Wolberg 
et  ah,  1998). 

3.4  Predicted  vs.  actual  group  survival 

Another  criterion  for  the  validity  of  the  learned  model 
is  whether  the  predicted  recurrence  rate  follows  that 
of  the  actual  data.  Figure  3  shows  the  Kaplan-Meier 
estimate  of  disease-free  survival  curve  for  the  entire 
WPBC  training  set,  compared  with  the  predicted  DFS 
rates  accumulated  from  the  test  folds.  Again,  a  test 
case  is  predicted  to  recur  at  time  t  if  the  activation  of 
output  node  t  is  the  first  one  indicating  a  DFS  prob¬ 
ability  of  less  than  0.5.  The  two  curves  are  very  simi¬ 
lar  and  show  no  significant  statistical  difference  (p  = 
0.2818,  generalized  Wilcoxon  test  (Gehan,  1965)). 


Figure  3:  WPBC  Data:  Kaplan-Meier  estimate  of  true 
disease-free  survival  curve  compared  to  predicted  DFS 
curve. 


The  predicted  group  survival  for  the  SEER  data  did 
not  closely  match  the  actual  survival  curve.  This  is 
consistent  with  previous  research  (Street  et  al.,  1996) 
using  a  variation  of  the  RSA  prognostic  technique 
(Street  et  al.,  1995)  which  also  showed  that  the  SEER 
data  was  unable  to  replicate  group  survival  character¬ 
istics.  This  is  attributable  to  the  coarse  encoding  of 
the  SEER  variables  and  the  large  percentage  of  miss¬ 
ing  values. 

3.5  Individual  prognostic  prediction 

As  mentioned,  the  activations  of  the  output  units  can 
be  combined  to  form  a  predicted  DFS  curve  for  an  in¬ 
dividual  patient.  Figure  4  shows  an  example  of  this 
usage,  in  a  format  appropriate  for  a  clinical  setting. 
Here  the  probabilities  of  disease-free  survival  for  a  case 
from  the  WPBC  study  are  compared  to  the  cumulative 
values  of  all  patients  in  the  study.  The  output  activa¬ 
tions  were  monotonically  non-increasing,  as  was  the 
case  in  74%  of  the  WPBC  test  examples  and  87%  of 
the  SEER  examples.  The  others  had  occasional  small 
increases,  with  the  maximum  increase  in  any  exam¬ 
ple  corresponding  to  to  a  probability  change  of  0.024 
(WPBC)  and  1.0  (SEER).  The  expected  time  of  re¬ 
currence  can  be  computed  by  noting  where  the  DFS 
curve  crosses  a  probability  of  50%,  in  this  case,  be¬ 
tween  three  and  four  years.  In  fact,  this  patient  did 
experience  disease  recurrence  in  the  44th  month  fol¬ 
lowing  surgery. 


Figure  4:  Predicted  DFS  curve  of  a  single  case  (87-112) 
from  the  WPBC  study  compared  to  the  overall  group 
DFS  curve  of  the  training  set. 
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4  Conclusions 

This  paper  develops  a  novel  encoding  of  censored  data 
in  an  artificial  neural  network  architecture  to  provide 
a  framework  for  prognostic  prediction.  In  applying 
the  method  to  breast  cancer  prognosis,  the  resulting 
models  are  shown  to  be  at  least  as  accurate  as  current 
methods,  while  providing  significantly  more  precision 
and  flexibility.  Among  the  future  directions  for  this 
research  is  a  sensitivity  analysis  to  investigate  the  im¬ 
portance  of  the  prognostic  features  at  different  follow¬ 
up  times.  To  evaluate  the  role  of  knowledge  transfer, 
predictive  accuracy  will  be  compared  to  classification 
models  that  predict  recurrence  at  a  chosen  cut  point. 
Most  importantly  from  a  clinical  perspective,  our  work 
in  the  breast  cancer  domain  continues  to  focus  on  gen¬ 
erating  accurate  prognostic  models  without  knowledge 
of  lymph  node  status,  in  order  to  spare  new  patients 
an  extra  and  potentially  debilitating  surgery. 
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Abstract 

A  common  task  required  of  a  dancer  or  ath¬ 
lete  is  to  move  from  one  prescribed  body  pos¬ 
ture  to  another  in  a  manner  that  is  consis¬ 
tent  with  a  specific  style.  One  can  automate 
this  task,  for  the  purpose  of  computer  ani¬ 
mations,  using  simple  machine-learning  and 
search  techniques.  In  particular,  we  find  ki- 
nesiologically  and  stylistically  consistent  in¬ 
terpolation  sequences  between  pairs  of  body 
postures  using  graph-theoretic  methods  to 
learn  the  “grammar”  of  joint  movements  in 
a  given  corpus  and  then  applying  memory- 
bounded  A*  search  to  the  resulting  transition 
graphs  —  using  an  infiuence  diagram  that 
captures  the  topology  of  the  human  body  in 
order  to  reduce  the  search  space. 


1  INTRODUCTION 

A  common  task  required  of  a  dancer  or  athlete  is  to 
move  from  one  prescribed  body  posture  to  another 
in  a  manner  that  is  consistent  with  a  specific  style. 
If  these  postures  are  “far  apart,”  as  measured  by 
some  metric  that  takes  into  account  both  the  kine¬ 
siology  of  the  body  and  the  style  of  the  movement 
genre,  this  can  be  nontrivial.  For  the  purposes  of 
computer-generated  animation,  there  are  a  variety  of 
ways  to  generate  movement  sequences  that  accom¬ 
plish  this  kind  of  task.  One  can,  for  instance,  use 
mathematical  interpolation  techniques  like  splines  to 
move  individual  body  parts  from  one  position  to  an¬ 
other,  but  these  kinds  of  methods  do  not  address  the 
problem  of  kinesiological  illegality  (e.g.,  that  the  knee 
only  bends  180  degrees,  or  that  arms  cannot  pass 
through  ribcages).  Many  animation  packages,  such  as 
Life  Forms  (http://fas.sfu.ca/lifelorms.htinl), 
use  an  augmented  spline  approach  that  relies  on  a 
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table  of  kinematic  constraints  to  avoid  illegal  move¬ 
ments,  but  this  type  of  approach  is  somewhat  ad  hoc. 
A  more-general  way  is  to  use  the  physics  of  the  body: 
derive  the  associated  differential  equations  —  a  torque 
balance  for  each  joint,  say  —  and  solve  the  equivalent 
boundary- value  problem.  Approaches  like  this(Hod- 
gins  et  al  1995)  are  extremely  interesting  and  highly 
promising,  but  also  very  diflScult;  deducing  the  control 
equations  that  humans  use  to  recover  their  balance  af¬ 
ter  a  jump,  for  example,  is  a  Ph.D.  thesis-level  prob- 
lem( Wooten  1998).  StyJisticaiiy  faithful  interpolations 
would  be  even  harder  to  implement;  neither  splines  nor 
F  =  mo  can  easily  capture  or  enforce,  for  instance, 
the  requirement  that  classical  ballet  emphasizes  po¬ 
sition  over  motion^,  and  developing  a  mathematics- 
or  physics-based  approach  that  does  so  would  be  all 
but  impossible.  In  this  paper,  we  propose  an  alter¬ 
native  solution  to  this  problem:  a  class  of  corpus- 
based  interpolation  schemes  that  generate  a  kinesiolog- 
ically  and  stylistically  consistent  movement  sequence 
between  two  specified  body  positions  by  learning  and 
then  enforcing  the  dynamics  of  a  particular  movement 
genre. 

The  primary  motivation  for  the  development  of  these 
methods  was  our  work  on  a  mathematical  tech- 
nique(Bradley  &  Stuart  1997;  1998)  that  automat¬ 
ically  creates  variations  on  predefined  motion  se¬ 
quences  —  an  idea  that  was  inspired  by  a  similar 
scheme(Dabby  1996;  1997)  that  uses  a  related  proce¬ 
dure  to  generate  musical  variations.  We  use  the  math¬ 
ematics  of  nonlinear  dynamics  to  shuflde  a  predefined 
movement  sequence  by  “wrapping”  a  progression  of 
special  symbols  representing  the  body  positions  in  a 
dance  piece,  martial  arts  form,  or  other  motion  se¬ 
quence  around  a  chaotic  attractor.  This  establishes 
a  symbolic  dynamics  that  links  the  movement  pro- 

^In  ballet,  body  parts  tend  to  describe  piecewise-linear 
paths  through  space,  emphasizing  the  positions  at  the  junc¬ 
tions  of  those  linear  segments;  in  modern  dance,  on  the 
other  hand,  the  motion  between  the  endpoints  is  the  im¬ 
portant  feature. 
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gression  and  the  attractor  geometry,  as  shown  in  fig¬ 
ure  1.  By  definition,  trajectories  from  different  start¬ 
ing  points^  travel  along  the  same  attractor  but  in  a 
different  order.  This  property  lets  us  use  the  mapping 
depicted  in  figure  1(d)  to  create  a  variation:  we  sim¬ 
ply  follow  a  new  trajectory  around  the  attractor  and 
invert  the  symbolic  mapping,  “playing”  the  body  po¬ 
sition  for  each  cell  the  trajectory  enters.  Variations 
generated  in  this  manner,  whether  musical  or  choreo¬ 
graphic,  are  both  aesthetically  pleasing  and  strikingly 
reminiscent  of  the  original  sequences.  The  stretching 
and  folding  of  the  chaotic  dynamics  guarantee  that  the 
ordering  of  the  pitches  or  movements  in  the  variation 
is  different  from  the  original  sequence;  at  the  same 
time,  the  fixed  geometry  of  the  attractor  ensures  that 
a  chaotic  variation  of  Bach’s  Prelude  in  C  Major  or 
of  a  short  Balanchine  ballet  sequence  are  related  to 
the  original  piece  in  a  sense  reminiscent  of  the  classic 
“variation  on  a  theme.”  Broadly  speaking,  the  chaotic 
variations  resemble  the  originals  with  some  shuffling  of 
coherent  subsequences.  This  is  the  primary  source  of 
the  stylistic  originality  of  the  chaotic  variation  scheme 
—  in  fact,  this  type  of  subsequence  shuffling  is  a  well- 
established  creative  mechanism  in  modern  choreogra¬ 
phy.  One  problem  with  any  choreographic  technique, 
automated  or  not,  that  involves  subsequence  reorder¬ 
ing,  however,  is  that  the  transitions  at  the  subsequence 
boundaries  can  be  quite  jarring.  Figure  2,  for  exam¬ 
ple,  shows  a  short  section  of  a  chaotically  generated 
variation  on  a  short  ballet  adagio.  Note  the  abrupt 
transition  between  the  fifth  and  sixth  moves  of  the 
variation. 

The  interpolation  algorithms  that  are  the  topic  of  this 
paper  can  smooth  these  kinds  of  transitions  in  a  man¬ 
ner  that  is  both  kinesiologically  and  stylistically  con¬ 
sistent.  These  graph-theoretic  methods  “learn”  the 
grammar  of  joint  movements  in  a  given  corpus  and 
then  apply  memory-bounded  A*  search  —  using  an 
influence  diagram  that  models  the  relationships  of 
the  joints  in  the  human  body  in  order  to  reduce  the 
otherwise-intractable  search  space  —  to  find  an  ap¬ 
propriate  interpolation  sequence  between  two  given 
body  positions.  The  search  is  complicated  by  the  fact 
that  joint  positions  cannot  be  interpolated  in  isolation: 
the  movement  patterns  of  the  ankle,  for  instance,  are 
strongly  influenced  by  whether  or  not  the  foot  is  on 
the  ground  —  information  that  is  implicit  in  the  posi¬ 
tions  of  the  pelvis,  knees,  etc.  This  requires  that  the 
expansion  of  nodes  in  the  search  be  context  dependent 
in  a  somewhat  unusual  way.  The  resulting  interpola¬ 
tion  procedures,  which  were  developed  and  evaluated 
in  close  collaboration  with  several  expert  dancers,  are 
quite  effective  at  capturing  and  enforcing  the  dynamics 
of  a  given  group  of  movement  sequences. 


(d) 

Figure  1:  A  chaotic  mapping  that  links  a  short  ballet 
adagio  and  the  chaotic  Rossler  attractor.  A  Voronoi 
diagram  is  used  to  divide  the  region  covered  by  the  tra¬ 
jectory  shown  in  part  (a)  into  cells,  yielding  the  tiling 
shown  in  part  (b).  The  order  in  which  the  original  tra¬ 
jectory  traverses  those  cells  defines  the  temporal  order 
of  the  cell  itinerary  that  corresponds  to  that  trajec¬ 
tory.  Successive  body  positions  from  the  predefined 
movement  sequence  (c)  are  mapped  to  successive  cells 
in  that  itinerary,  linking  the  structure  of  the  movement 
sequence  and  the  attractor  geometry.  A  small  section 
of  the  overall  mapping  is  shown  in  part  (d). 


^within  the  basin  of  attraction,  of  course 
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Figure  2:  Part  of  a  variation  on  a  short  ballet  sequence,  generated  using  the  chaotic  shuffling  procedure  dia¬ 
grammed  in  the  previous  figure.  Note  the  abrupt  transition  between  the  fifth  and  sixth  frames.  The  interpolation 
schemes  described  in  this  paper  can  be  used  to  smooth  such  transitions  in  a  kinesiologically  and  stylistically  con¬ 
sistent  fashion. 


2  CORPUS-BASED 

INTERPOLATION  ALGORITHMS 
FOR  MOVEMENT  SEQUENCES 

The  interpolation  schemes  described  in  this  section  use 
corpora  of  human  movement  —  a  corpus  composed  of 
ten  Balanchine  ballets,  for  instance,  if  one  is  work¬ 
ing  with  dances  of  that  particular  genre^  —  to  select 
a  movement  sequence  that  would  naturally  occur  be¬ 
tween  a  given  pair  of  body  postures.  The  basic  algo¬ 
rithms  involved  are  fairly  straightforward,  but  the  ap¬ 
plication  requires  some  unusual  tactics  and  variations. 
We  first  examine  the  corpus,  capturing  typical  progres¬ 
sions  of  joint  positions  in  a  set  of  transition  graphs. 
Then,  given  a  pair  of  body  postures,  we  use  a  variant 
of  the  A*  algorithm  to  search  these  graphs  for  interpo¬ 
lation  subsequences.  A  typical  interpolation  sequence 
might,  for  instance,  first  move  the  shoulder  from  its 
position  in  the  fifth  frame  of  figure  2  to  its  position 
in  the  sixth  frame  according  to  the  rules  for  shoulder 
movement  that  are  implicit  in  the  corpus,  then  repeat 
for  the  elbow,  and  so  on. 

Our  original  approach(Bradley  &  Stuart  1997)  was 
much  more  coarse-grained;  the  atomic  representational 
unit  was  a  full  body  position  and  the  patterns  in  the 
corpus  were  represented  in  a  single  graph  that  had 
one  vertex  for  each  observed  posture.  This  approach 
was  both  impractical  and  unsatisfying.  Firstly,  it  did 
not  scale  well  with  corpus  size  because  the  number  of 
unique  body  positions  is  so  large.  Secondly,  it  could 
only  populate  interpolation  sequences  with  verbatim 
copies  of  full-body  positions  that  appeared  in  the  cor¬ 
pus.  The  methods  described  in  this  paper,  on  the  other 
hand,  construct  the  body  positions  in  the  interpolation 


®The  composition  of  the  corpus  will,  of  course,  affect 
the  nature  of  the  interpolation;  smoothing  abrupt  transi¬ 
tions  in  ballet  pieces  using  an  interpolation  scheme  that  is 
mathematically  rooted  in  a  karate  corpus  will  negate  the 
very  aesthetic  resemblance  that  this  approach  strives  to 
preserve.  On  the  other  hand,  this  might  be  an  interesting 
source  of  innovation,  whereby  one  could  mathematically 
mix  two  or  more  styles. 


sequence  in  a  joint- wise  manner  and  on  the  fly.  This 
scheme  not  only' avoids  the  storage  problems  of  the 
previous  approach,  but  also  allows  innovation:  it  can 
generate  sequences  that  contain  body  positions  that 
do  not  appear  in  the  corpus. 

2.1  BODY  POSTURE  REPRESENTATION 

We  represent  a  human  body  posture  by  specifying  the 
position  of  each  of  the  23  main  joints  with  a  quater¬ 
nion,  a  standard  representation  in  rigid-body  mechan¬ 
ics  that  dates  back  to  Hamilton(Goldstein  1980).  A 
quaternion  q  =  (r,  u)  consists  of  an  axis  of  rotation  u 
and  a  scalar  r  that  specifles  the  angle  of  rotation  of  the 
joint  about  u.  Thus,  a  body-position  symbol  is  quite 
complicated:  23  descriptors  (pelvis,  right-wrist, 
etc.),  92  numbers  (four  for  each  joint),  and  a  variety 
of  information  about  the  position  and  orientation  of 
the  center  of  mass. 

Joint  orientations  are,  in  reality,  continuous  vari¬ 
ables,  but  computational  complexity  requires  that 
they  be  discretized  in  our  algorithms.  Specifically, 
each  joint  A  can  take  on  a  finite  number  of  al¬ 
lowed  orientations'^.  Formally,  we  define  as  the  set 
of  allowed  orientations  for  joint  A  and  then  replace  the 
actual  orientation  of  the  joint  with  the  closest  quater¬ 
nion  in  Q^.  We  can  express  a  body  position  6  as 
a  discretized  vector  s  by  setting  each  of  its  compo¬ 
nents  sa  equal  to  the  quaternion  in  that  is  closest 
to  6a:  sx  =  r  such  that  ||6a  —  r||  <  ||6a  ^  9||  for  all 
g,r  G  where  ||a;  -  2/||  is  the  Euclidean  distance® 
between  the  quaternions  x  and  y.  We  can  find  r  in 
log(M^)  time  using  K-D  trees(Friedman,  Bentley,  k 
Finkel  1977)  to  represent  the  sets.  The  procedure 
described  in  this  paragraph  is  analogous  to  “snapping” 
objects  to  a  grid  in  computer  drawing  applications. 

Deriving  a  successful  discretization  of  joint  states 


^In  practice,  <  400. 

®One  of  the  main  advantages  of  quaternions  is  that  they 
can  be  treated  as  4- vectors  in  the  standard  norm  and  trans¬ 
formation  operations. 
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was  unexpectedly  difficult.  Simply  discretizing  the 
quaternion  variable  values  —  that  is,  classifying  all 
positions  between,  say,  (right-wrist,  1,  1,  0,  1) 
and  (right-wrist,  1,  1,  0.2,  1)  as  an  equiva¬ 
lence  class  and  representing  them  in  the  algorithms 
as  a  single  posture  —  produced  visibly  awkward  ani¬ 
mations.  The  human  visual  perception  system  appears 
to  be  very  sensitive  to  small  variations  in  quaternion 
coefficients:  any  change  in  a  single  coefficient  seems  to 
violate  the  “motif”  of  the  motion.  The  same  problem 
arose  when  we  attempted  a  physically  more-realistic 
discretization  by  transforming  quaternion  data  to  Eu¬ 
ler  angles  and  then  discretizing  6,  and  ij)  instead. 
The  solution  on  which  we  eventually  settled  uses  a 
discretization  library  that  was  created  by  hand  by  an 
expert  dancer. 

2.2  REPRESENTATION  OF  A 
MOVEMENT  CORPUS 

2.2.1  Joint  Transition  Graphs 

A  transition  graph  is  a  weighted-directed  graph  that 
captures  the  transition  probabilities  in  a  symbol  se¬ 
quence.  In  general,  each  vertex  v  in  such  a  graph  rep¬ 
resents  a  symbol  and  each  weighted  edge  (v,  u)  reflects 
the  probability  that  the  symbol  associated  with  vertex 
u  follows  the  symbol  associated  with  vertex  v.  For  the 
purposes  of  analyzing  a  human  movement  corpus,  we 
build  one  transition  graph  for  each  joint,  using  the  cor¬ 
pus  to  identify  orientations  that  the  Joint  assumes  and 
to  estimate  the  corresponding  transition  probabilities. 
Vertices  in  this  kind  of  graph  represent  particular  dis¬ 
cretized  joint  orientations,  and  edges  correspond  to  the 
movement  of  the  joint  from  one  orientation  to  another. 

The  transition  graph  construction  procedure  is  fairly 
straightforward.  We  first  transform  every  body  po¬ 
sition  in  the  corpus  to  a  discretized  position,  cis  de¬ 
scribed  in  the  previous  section,  so  that  a  consecu¬ 
tive  pair  of  body  positions  (a,  6),  each  consisting  of 
23  continuous-valued  quaternions,  becomes  the  dis¬ 
cretized  pair  (s,  i)  where  s,  t  each  consist  of  23  dis¬ 
cretized  quaternions.  We  then  build  a  transition  graph 
for  each  joint  A;  contains  vertices,  each  of 
which  corresponds  to  exactly  one  quaternion  in  Q^. 
For  convenience,  we  will  refer  to  vertices  in  G^  by 
the  corresponding  quaternions  in  Q^.  We  record  the 
fact  that  joint  A  is  allowed  to  move  from  a\  to  bx  by 
introducing  an  edge  in  G^  from  vertex  sx  to  vertex 
tx-  We  assign  a  weight  to  this  edge  that  models  the 
“unlikeliness”  with  which  such  a  transition  occurs  in 
the  corpus.  This  measure  of  unlikeliness  is  related  to 
P(q  — »■  r),  the  probability  that  joint  A  moves  from  the 
quaternion  q  G  to  the  quaternion  r  G  ,  per  the 
following  expression  for  the  weight  of  edge  (g,  r)  G  G^: 

=  -logiP{q  r))  =  -log{P{r\q)) 


^log(C{q))  -  log{C{q,r)) 

where  C{q)  is  the  number  of  times  joint  A  assumed 
an  orientation  approximated  by  q  and  C{q,r)  is  the 
number  of  times  that  the  ordered  pair  (g,  r)  occurred. 
Larger  weights  correspond  to  transitions  that  are  less 
likely  to  occur® . 

Figure  3  shows  a  transition  graph  for  the  hips  that 
was  constructed  in  this  fashion  from  a  corpus  of  38 
short  ballet  sequences  totaling  1720  positions.  In  the 
interests  of  clarity,  edge  weights  and  isolated  vertices 
have  been  omitted  from  this  figure.  The  intricate  pat¬ 
terns  in  these  dance  progressions  are  reflected  by  the 
complex  topology  of  the  graph. 

2.2.2  Coordinating  Joint  Movements 

A  joint  transition  graph  represents  the  behavior  of  a 
joint  in  isolation.  This  information,  alone,  cannot  cap¬ 
ture  the  physical  con.straints  that  govern  the  coordi¬ 
nation  of  the  joints  in  the  body.  For  example,  if  the 
shoulder  is  in  its  resting  position  with  the  palm  facing 
the  thigh,  the  elbow  can  bend  nearly  180  degrees,  but 
if  the  shoulder  is  turned  90  degrees  on  its  long  axis 
(until  the  palm  faces  backwards),  the  elbow  can  only 
bend  about  five  degrees  before  the  hand  collides  with 
the  leg.  In  order  to  construct  sensible  interpolation 
sequences,  we  need  a  simple  and  efficient  model  of  this 
type  of  joint  coordination. 

The  most  complete  and  general  approach  to  this  prob¬ 
lem  would  be  to  model  the  interactions  between  each 
joint  and  every  other  joint  in  the  body,  but  doing 
so  engenders  a  combinatorial  explosion  in  the  search 
space.  There  are  sensible  ways  to  reduce  the  complex¬ 
ity  of  the  problem,  however;  to  a  first  approximation, 
a  joint  is  not  influenced  by  every  other  joint  in  the 
body.  The  position  of  the  wrist,  for  instance,  strongly 
affects  the  position  of  the  fingers  but  has  little  effect 
on  the  toes.  We  put  this  simplifying  a.ssumption  into 
effect  by  using  an  influence  diagram(01iver  k  Smith 
1990)  that  reflects  the  structure  and  physics  of  the  hu¬ 
man  body  to  explicitly  represent  the  relationships  of 
the  joints  to  one  another.  As  shown  in  figure  4(b), 
the  nodes  (joints)  in  the  tree  only  affect  the  position 
of  their  immediate  children.  The  pelvis  is  the  root  of 
this  tree;  three  branches  lead  from  this  root  to  nodes 
corresponding  to  the  right  hip,  the  left  hip,  and  the 
lower  spine^.  Each  hip  joint  is  the  parent  node  to  a 
knee,  and  so  on.  We  assign  a  conditional  probability 
distribution,  estimated  from  the  corpus,  to  every  (par¬ 
ent, child)  pair  in  the  tree.  For  every  combination  of 

®Giveii  this  formulation,  saying  that  two  vertices  are 
disconnected  is  synonymous  with  saying  that  two  are  con¬ 
nected  by  an  edge  with  infinite  weight. 

^The  sacrum  and  the  five  lumbar  vertebrae  are  lumped 
together.  This  compromise  sacrifices  back  suppleness  for 
lowered  complexity. 
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Figure  3:  A  transition  graph  that  represents  the  move¬ 
ment  patterns  of  the  hips  in  a  small  corpus  of  38  short 
ballet  sequences.  The  numbers  in  each  state  identify 
the  discretized  position  of  the  joint.  Edge  weights  and 
isolated  vertices  have  been  omitted  in  the  interests  of 
clarity. 


Figure  4i  An  influence  diagram  that  explicitly  rep¬ 
resents  the  coordination  of  joints  of  the  human  body. 
Part  (a)  depicts  the  body  and  part  (b)  shows  the  inter¬ 
joint  dependencies  induced  by  gravity  and  topology, 
for  instance,  the  position  of  the  pelvis  influences  the 
positions  of  both  hips  hr  and  hi  and  the  lumbar  spine 
/,  but  the  right  and  left  ankles  kr  and  ki  do  not  di¬ 
rectly  influence  one  another.  Without  this  simplifying 
assumption,  the  search  space  for  this  problem  is  in¬ 
tractable. 


states  that  a  parent  A  and  its  child  p  can  ^sume,  the 
distributions  estimate  the  probability  that  joint  fx  is  in 
orientation  r  given  that  joint  A  is  in  orientation  q,  for 
every  pair  of  discretized  of  quaternions  q  &  Q  >  ■ 

2.3  A  JOINT- WISE  INTERPOLATION 
ALGORITHM 

Given  a  pair  of  discretized  body  postures  (s,  t)  and 
a  set  of  23  transition  graphs  (one  for  each  joint),  we 
can  use  a  memory-bounded  A*  search  strategy(Win- 
ston  1992)  to  find  an  interpolation  subsequence  that 
moves  smoothly  between  s  and  t.  In  general.  A*  finds 
a  path  from  an  initial  state  to  a  goal  state  by  progres¬ 
sively  generating  successors  of  the  current  state  in  the 
search.  The  algorithm  places  successor  states  on  a  pri¬ 
ority  queue,  sorted  according  to  a  score  that  estimates 
the  cost  of  finding  a  goal  state.  In  the  next  iteration, 
the  state  with  the  best  score  is  drawn  from  the  priority 
queue,  its  successor  states  are  computed  and  added  to 
the  queue,  and  the  procedure  is  repeated  until  a  goal 
state  is  found  or  until  the  queue  is  empty. 

In  this  application,  the  states  in  the  A*  search  space 
are  body  states  —  23-vectors  of  discretized  quater¬ 
nions  that  represent  full  body  positions.  To  generate 
successors  of  a  body  state  s,  we  first  use  the  transi¬ 
tion  graphs  to  find  successors  for  each  joint  state  sx 
independently,  and  then  take  all  combinations  (cross 
product)  across  the  joints  to  obtain  the  list  of  body- 
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state  successors.  From  this  list,  we  can  filter  out  the 
disallowed  body  positions  using  the  influence  diagram 
and  the  probability  distribution  of  parent-child  pairs. 
The  successors  of  the  joint-state  are  those  vertices 
in  that  are  connected  to  sx  by  an  edge  directed 
away  from  sx . 

The  score  assigned  to  a  body  state  u  has  two  parts: 

1.  the  cost  of  the  path  from  the  initial  state  s  to  it 

2.  an  estimate  of  the  distance  between  u  and  the  goal 
state  t 

The  cost  of  the  path  starting  at  s  and  ending  at  ii  is 
simply  the  sum  of  the  costs  of  the  transitions  taken 
in  the  path.  Furthermore,  since  each  body  move¬ 
ment  is  composed  of  a  group  of  joint  movements,  we 
can  compute  the  cost  of  one  body-state  transition  by 
summing  the  weights  over  the  edges  traver.sed  by  the 
joints.  To  make  this  concrete,  suppose  we  are  try¬ 
ing  to  find  an  interpolating  path  between  the  body 
states  s  and  t.  At  some  point  in  the  search,  we  reach 
the  body-state  u  and  must  assign  the  path  from  s 
to  u  a  score.  If  we  write  the  path  from  s  to  u  as 
s  ^  u  =  =  s,x^,  ■  ■  ■ ,  x^~^,x‘  =  u),  we  can  ex¬ 

press  the  cost  of  such  a  path  as 


model  and  enforce  the  symmetry  of  the  body,  we  could 
combine  left  and  right  counterparts  into  one  node.) 

3  RESULTS  AND  EVALUATION 

The  “goal”  of  choreography  is  aesthetic  appeal,  so 
it  is  difficult  to  analyze  the  results  of  this  work  us¬ 
ing  standard  scientific  methods®.  However,  there  arc 
some  standard  rules,  procedures,  and  patterns  in  cer¬ 
tain  dance  and  martial  arts  genres  that  can  be  used 
to  evaluate  the  interpolation  sequences  generated  by 
the  corpus-based  techniques  described  in  the  previ¬ 
ous  sections.  The  evaluation  described  in  this  sec¬ 
tion  is  a  highly  condensed  transcript  of  a  dozen  one- 
to  two-hour  sessions,  wherein  expert  dancers  —  pri¬ 
marily  Profes.sor  David  Capps  of  the  Department  of 
Theater  and  Dance  at  the  University  of  Colorado,  an 
accomplished  dancer  and  choreographer  whose  works 
have  appeared  on  stages  around  the  world,  and  Na¬ 
dia  Rojasadamc,  a  student  in  that  department  and 
the  composer  of  the  adagio  used  to  generate  the  vari¬ 
ations  shown  in  figures  1  and  2  —  went  through  the 
results  frame  by  frame,  answering  and  then  discussing 
the  following  questions: 


i  =  l  A 


'41 


The  heuristic  part  of  the  score,  h(u),  estimates  how 
far  u  is  from  the  goal  state  t.  h{u)  is  calculated  by 
summing  the  weights  of  the  shortest  paths  from  ux 
to  tx,  uxftx  G  over  all  the  joints.  We  obtain 
these  shortest  path  weights  using  Dijkstra’s  single- 
source  shortest  path  algorithm(Dijkstra  1959),  imple¬ 
mented  as  described  in  (Cormen,  Leiserson,  &  Rivest 
1990).  The  final  score  assigned  to  body-state  u  is  then 
f{s  ^  u)  =  g{s  u)  -b  h{v). 

At  the  time  of  this  writing,  we  have  only  done  exten¬ 
sive  testing  on  a  greedy  search  strategy  that  ignores 
the  cost  of  paths  and  scores  nodes  in  the  search  based 
solely  on  the  estimated  distance  between  them  and  the 
goal  (i.e.,  /(s  u)  =  h{u)).  In  the  following  section, 
we  describe  the  implications  of  this  strategy  and  sug¬ 
gest  how  different  A*  scoring  functions  arc  likely  to 
affect  the  interpolation  sequences.  We  are  also  work¬ 
ing  on  incorporating  more  information  about  the  po¬ 
sition,  velocity,  and  acceleration  of  the  center  of  mass, 
so  the  momentum  of  the  body  is  conserved  as  it  passes 
through  the  interpolated  sections  of  the  movement;  ac¬ 
complishing  this  will  require  wide-ranging  adaptations 
to  the  basic  A*  algorithm  and  perhaps  even  a  wholly 
different  approach.  Finally,  we  are  also  in  the  pro¬ 
cess  of  testing  how  different  influence  diagram  topolo¬ 
gies  affect  the  interpolation  algorithm’s  ability  to  se¬ 
lect  good  postures  during  the  search.  (For  example,  to 


•  Does  this  posture  transition  look  reasonable? 

•  If  so,  why  and  how? 

•  If  not,  why  and  how?  What  would  you  do  instead? 
How  many  poses  would  you  assume  in  doing  so? 


In  order  to  make  this  process  le.ss  subjective,  we  are 
dcv'eloping  a  formal  evaluation  protocol,  consisting  of 
several  subsequences  and  a  series  of  scored  questions 
about  the  flow  of  the  movement  therein,  to  be  ad¬ 
ministered  to  groups  of  University  of  Colorado  dance 
students. 


Figure  5  shows  a  movement  sequence  that  the  learn¬ 
ing  and  search  algorithms  described  in  the  previous 
sections  produced  when  given  the  task  of  interpolat¬ 
ing  between  the  fifth  and  sixth  frames  of  the  ballet 
sequence  in  figure  2.  The  search  strategy  was  a  sim¬ 
ple  greedy  approach  —  an  A*  score  /(s  ^  i7)  =  h{v) 
that  only  factored  in  the  distance  to  the  goal  —  and 
the  corpus  included  38  short  ballets.  The  starting  and 
ending  body  postures  (top  left  and  top  right  in  fig¬ 
ure  5,  labeled  and  10  ,  respectively)  are  quite  dif¬ 
ferent;  note  the  facing  of  the  dancer  and  the  weight 
distribution  on  the  feet,  for  example.  The  eight-move 
interpolation  sequence  computed  by  the  interpolator 
moves  between  those  positions  in  a  very  natural  way. 
Its  first  move,  for  instance,  is  to  lower  the  left  leg,  a 


®The  very  notion  of  objective,  qnantifyable  evaluation 
elicited  much  consternation  and  mirth  —  along  with  some 
offense  —  from  our  expert  dance  consultants. 
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Figure  5:  An  interpolation  sequence  computed  by  the  corpus-based  techniques  described  in  the  previous  sectiom 
The  starting  and  ending  positions  passed  as  input  to  the  interpolation  procedure  are  shown  at  the  top  left  and 
top  right,  respectively;  the  eight  frames  below  them  were  compnted  by  the  interpolator. 


natural  strategy  if  one  is  going  to  change  one’s  fac¬ 
ing  and  end  up  on  two  feet.  The  following  move  is 
a  simple  weight  shift  (frames  and  [^),  in  prepa¬ 
ration  for  a  lift  of  the  right  leg.  This  lift,  which  is 
not  strictly  necessary  to  move  from  the  fifth  frame  to 
the  tenth,  is  an  innovation  that  the  program  inserted 
because  of  the  observed  patterns  in  the  corpus;  it  re¬ 
flects  the  fact  that  ballet  dancers  rarely  spin  with  both 
feet  flat  on  the  ground.  Perhaps  the  most  interesting 
thing  about  this  interpolation  sequence,  from  a  bal¬ 
letic  standpoint,  is  the  releve®  that  the  interpolation 
procedure  inserted  between  frames  and  J^.  Many 
releves  appear  in  the  corpus,  but  none  of  them  are 
associated  with  upper  body  positions  that  resemble 
the  one  that  appears  in  this  sequence.  Our  algorithm 
has  invented  a  physically  and  stylistically  appropriate 
way  to  move  the  dancer  between  the  specified  posi¬ 
tions.  The  interpolation  sequence  in  figure  5  includes 
a  variety  of  other  stylistically  consistent  innovations 
as  well;  consider,  for  example,  the  uplifted  chest  and 
chin  in  frames  and  —  posture  elements  that  are 
quintessential  ba  let  style.  Recall  that  these  postures 
were  not  simply  pasted  in  verbatim  from  the  corpus; 
they  were  synthesized  joint  by  joint  using  the  transi¬ 
tion  graphs  and  inflnence-diagram  directed  A*  search, 
and  their  fit  to  the  genre  is  strong  evidence  of  the  suc¬ 
cess  of  the  methods  described  in  the  previous  section. 


The  original  ballet  sequence  from  which  the  snapshot 
in  figure  1  was  drawn  contained  68  frames,  and  the 
chaotic  shuffling  scheme  introduced  23  abrupt  tran¬ 
sitions  into  the  variation  (e.g.,  frames  5  6  of  fig¬ 

ure  2).  In  eleven  of  those  23  cases  —  including  the 
one  depicted  in  figure  5  —  our  interpolation  scheme 
was  successful  in  interpolating  smoothly  between  the 


two  moves  that  framed  the  gap.  The  interpolation  sub¬ 
sequences  so  constrncted,  which  ranged  in  length  from 
two  to  60  frames,  included  a  variety  of  stylistically  con¬ 
sistent  and  often  innovative  sequences;  among  other 
things,  the  interpolation  algorithms  used  releves,  plies 
and  fifth-position  rests  in  highly  appropriate  ways  — 
and  all  with  no  hard  coding.  From  a  subjective  artistic 
standpoint,  the  results  have  some  room  for  improve¬ 
ment;  there  are  still  five  somewhat-awkward  transi¬ 
tions  in  the  185  total  frames  of  the  11  interpolation 
sequences.  A  less-subjective  way  to  evaluate  the  suc¬ 
cess  of  this  scheme  is  to  compare  the  length  of  these 
interpolation  sequences  to  the  distance  between  the 
corresponding  postures  in  the  original  piece,  which  is 
presumably  a  good  metric  for  how  long  it  would  take 
a  human  to  move  from  one  to  the  other.  For  the  most 
part,  the  interpolated  sequences  were  shorter  than  or 
the  same  length^®  as  the  number  of  frames  separating 
the  corresponding  positions  in  the  original  piece,  which 
indicates  that  the  search  strategies  are  working  well, 
mpeg  movies  of  this  adagio  sequence  and  its  chaotic 
variation —  both  with  and  without  interpolation —  are 
available  on  the  web^^. 

This  example  brings  out  two  significant  failure  modes 
of  this  approach.  The  algorithms  cannot  find  interpo¬ 
lation  subsequences  between  body  positions  that  oc¬ 
cur  in  reversed  temporal  order  —  e.g.,  places  where 
the  chaotic  shuffler  has  forced  a  jump  backwards  in 
time,  inserting  a  move  into  the  variation  that  appeared 
earlier  in  the  original  piece.  Secondly,  the  algorithms 
sometimes  introduce  relatively  long  paths  between  po¬ 
sitions  that  appear  very  similar;  in  one  such  instance, 
where  the  task  was  a  simple  90-degree  rotation  of  the 
right  shoulder  around  the  long  axis  of  the  arm,  the 


®  A  releve,  which  consists  of  lifting  up  on  one’s  toes,  is 
a  stylistically  required  component  of  a  direction  shift  in 
ballet. 


^°Five  were  shorter  (77%  average),  four  were  the  same 
length,  and  two  were  somewhat  (150%  and  110%)  longer, 
^^nww .  cs .  Colorado .  edu/~lizb/chaotic-dance  .html 
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algorithm  constructed  an  65-move  sequence  that  in¬ 
volved  much  leg  and  trunk  movement.  Both  of  the.se 
problems  are  the  result  of  limited  corpus  size  and 
corresponding  patterns  in  the  joint  transition  graphs. 
These  graphs  are  far  from  being  connected,  so  some 
joint  orientations  are  not  reachable  from  others.  Even 
when  they  are  connected,  the  search  may  have  to  wan¬ 
der  all  over  the  graph  to  find  a  path  between  two  given 
vertices.  In  a  large,  rich  corpus,  the  graphs  would  be 
highly  connected,  giving  the  search  algorithms  more 
leeway.  In  the  existing  corpora,  however,  the  paucity 
of  edges  constrains  them  to  very  narrow  (and  long) 
search  paths  that  can  translate  to  stilted,  idio.syncratic 
movement  sequences.  This  is  an  unavoidable  problem 
in  this  application,  unfortunately;  the  dance  world  has 
not  yet  embraced  the  notion  of  computer  animation, 
so  the  availability  of  animated  dances  is  quite  limited. 

Long,  linear  vertex  chains  like  the  ones  at  the  top 
left  of  figure  3  are  introduced  into  the  joint  transition 
graph  when  one  animation  in  the  corpus  progresses 
through  orientations  that  do  not  occur  in  other  ani¬ 
mations.  The  directionality  in  these  chains  makes  it 
impossible  for  the  search  to  move  “upstream,”  which 
is  the  cause  of  the  first  failure  mode  described  in  the 
previous  paragraph.  We  could  fix  this  problem,  artifi¬ 
cially,  by  introducing  reverse  edges  into  the  graphs  in 
some  kinesiologically  and  stylistically  justifiable  way. 
For  every  transition  s  -+  f  seen  in  the  corpus,  for 
example,  we  could  introduce  an  edge  from  tx  to  sx 
for  every  joint  A.  The  implicit  assumption  here  is 
that  it  is  always  possible  to  reverse  the  motion  of  a 
joint^^.  Thus,  at  the  expense  of  destroying  some  of 
the  accuracy  with  which  the  original  approach  mod¬ 
eled  the  temporal  asymmetry  of  the  genre,  we  could 
force  the  graphs  to  be  connected.  We  are  currently  in¬ 
vestigating  what  probabilities  to  use  on  these  reverse 
edges.  Artificially  introduced  reverse  transitions  would 
not  solve  the  second  problem;  chains  —  even  bidi¬ 
rectional  chains  —  tend  to  lengthen  interpolated  se¬ 
quences.  One  solution  to  this  problem  is  to  add  more 
examples  to  the  corpus  to  enrich  its  connectivity.  If 
more  examples  are  hard  to  come  by,  another  (artifi¬ 
cial)  solution  is  to  perform  a  coarser  discretization  to 
minimize  the  number  of  possible  states  a  joint  can  as¬ 
sume.  We  are  currently  experimenting  with  different 
discretization  resolutions  to  simultaneously  minimize 
the  number  of  nodes  and  maximize  the  statistical  in¬ 
formation  content  of  the  transition  graphs. 

The  greedy  A*  search  strategy  is  reflected  by  “in¬ 
efficiencies”  in  the  interpolation  sequences  —  places 
where  the  dancer  appears  to  be  headed  towarcls  the 
goal  state,  but  then  moves  away.  For  example,  one 

^^This  makes  sense  for  classical  ballet,  but  not  modern 
dance;  motion  in  the  former  tends  to  be  “circular”  in  space, 
whereas  in  the  latter,  one  often  moves  a  limb  out  and  back 
along  the  same  path. 


of  the  interpolation  goals  in  figure  5  is  to  change  the 
facing  almost  180  degrees,  from  left  to  right.  By  the 
fourth  frame,  the  dancer  has  turned  to  the  right,  but  in 
the  fifth  frame  s/he  has  turned  back  to  the  left  again, 
which  is  part  of  what  necessitates  the  releve  sequence 
between  frames  and  pf].  We  arc  in  the  procc.ss 
of  testing  different  search  strategies  and  analyzing  the 
results;  instead  of  choosing  the  state  that  is  closest  to 
the  goal,  for  instance,  we  arc  incorporating  the  path 
weights  up  to  the  current  point  in  the  solution  as  part 
of  the  scoring  function.  This  should  allow  the  search 
algorithm  to  find  shorter,  more-direct  sequences.  Fi¬ 
nally,  note  that  some  search  strategies  —  e.g.,  always 
taking  the  highest-probability  branch  —  can  be  a  sig¬ 
nificant  source  of  cliche. 

In  order  to  explore  the  effects  of  joint  coordination,  we 
removed  the  influence  diagram  and  ran  simple,  unco¬ 
ordinated  A*  search  to  find  paths  between  positions. 
The  resulting  sequences  were  extremely  interesting. 
To  the  layman’s  eye,  they  look  jerky  and  unappeal¬ 
ing,  so  we  expected  negative  comments  about  them 
from  the  experts.  However,  it  seems  that  an  uncoordi¬ 
nated  path  through  a  classical  ballet  corpus  is  a  very 
good  w'ay  to  generate  modern  dance  sequences,  and 
the  results  were  inventive  and  appealing:  “Wow!  I’m 
going  to  use  that  move  in  my  next  piece!”  In  retro¬ 
spect,  this  makes  some  sense:  the  modern  dance  genre 
actively  works  at  violating  the  ballet  motif. 

The  interpolation  procedure  is  fairly  rapid.  Applying 
greedy  search  to  the  23  abrupt-transition  pairs  in  the 
68-frame  variation,  for  instance,  required'''’  280  .sec¬ 
onds  on  an  HP9000/735  workstation  running  HP-UX 
vl0.20  for  a  corpus  containing  1720  ballet  po.stures.  A 
more-complex  scoring  function  will  obviously  reqiiire 
longer  run  time.  Preliminary  runs  of  non-greedy  A*, 
for  example,  required  500  seconds  to  perform  the  same 
task  and  yielded  similar  results,  in  terms  of  quality,  se¬ 
quence  length,  etc.  The  complexity  also  increases  with 
corpus  size;  the  same  (non-greedy  A*)  task  on  an  aug¬ 
mented  corpus  of  5000  postures  —  the  1720  original 
frames  plus  3280  non-ballet  sequences  —  required  3620 
seconds.  The  chaotic  shuffling  procedure  is  also  fast: 
for  a  1000-position  movement  sequence,  the  chaotic 
shTiffling  procedure  required  18  seconds  on  the  same 
workstation,  while  a  9000-movc  sequence  required  156 
seconds. 

4  CONCLUSION 

By  applying  techniques  from  graph  theory,  artificial 
intelligence,  and  statistics  to  a  corpus  of  movement 
sequences  from  a  particular  genre,  the  interpolation 
methods  described  in  this  paper  automatically  con¬ 
struct  interpolation  sequences  that  move  from  one 

'^This  will  obviously  depend  on  the  positions  involved. 
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specified  body  posture  to  another  in  a. physically  and 
stylistically  coherent  fashion.  These  tactics  can  be 
used  to  smooth  abrupt  transitions  that  result  from 
subsequence  reordering,  a  common  creative  mecha¬ 
nism  in  modern  choreography  that  can  be  emulated 
mathematically  by  using  chaotic  dynamics  to  generate 
variations. 

Evaluating  the  results  of  this  work  is  necessarily  some¬ 
what  subjective.  We  have  shown  animations  of  a  vari¬ 
ety  of  different  chaotic  variations  to  hundreds  of  peo¬ 
ple,  including  dozens  of  dancers  and  martial  artists, 
both  with  and  without  smoothing  of  the  abrupt  tran¬ 
sitions.  We  have  also  worked  in  depth  with  several  ex¬ 
pert  dancers  in  order  to  evaluate  those  interpolation 
sequences  sensibly.  The  consensus  is  that  the  chaotic 
variations  with  smoothed  transitions  not  only  resemble 
the  original  pieces,  but  also  are  in  some  sense  pleasing 
to  the  eye.  They  are  both  different  from  the  origi¬ 
nals  and  faithful  to  the  dynamics  of  the  genre;  there 
are  no  jarring  transitions  or  out-of-character  moves. 
This  is  a  non-trivial  accomplishment.  A  previous  at¬ 
tempt  to  use  mathematics  to  generate  choreographic 
variations  —  a  subsequence  randomization  scheme  in¬ 
troduced  by  the  now  well-known  choreographer  Merce 
Cunningham  in  the  1960s  —  met  with  a  strongly  neg¬ 
ative  reception  in  the  dance  world,  primarily  because 
of  the  awkwardness  at  the  transition  points^^. 

Many  of  the  techniques  used  here,  as  well  as  others  on 
which  we  are  currently  working,  were  inspired  by  solu¬ 
tions  to  similar  problems  that  arise  in  computational 
linguistics  (e.g.,  learning  a  grammar  from  a  corpus  and 
then  using  it  to  construct  meaningful  sentences).  For 
example,  one  can  view  the  transition  graphs  in  sec¬ 
tion  2.2.1  and  figure  3  as  first-order  Markov  chains, 
where  a  single  chain  represents  the  probabilistic  be¬ 
havior  of  each  joint  in  the  body. 

The  objective  of  this  research  project  was  to  tailor 
generic  strategies  for  a  specific  high-dimensional  search 
problem  in  an  unusual  and  demanding  domain.  The 
results  could  be  extended  to  other  domains  where  the 
genre  of  sequence  is  important,  such  as  speech  recog¬ 
nition  (e.g.,  filling  in  missing  parts  of  a  signal)  or  text. 
Finally,  the  implementation  of  these  algorithms  allows 
for  arbitrary  body  topologies,  so  we  are  by  no  means 
limited  to  human  motion  sequences  —  though  one 
would,  of  course,  have  to  adapt  the  quaternion-based 
symbol  set  and  the  influence  diagram  to  the  topology 
of  the  limbs  and  joints  that  are  involved. 


‘^Since  that  time,  aJeatory  choreography  —  wherein  ran¬ 
domization  schemes  are  used  to  shuffle  sequences  has 
by  now  become  one  of  the  important  currencies  of  dance 
composition  approaches.” (Capps  1998). 
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Abstract 

Several  researchers  have  proposed  modeling 
temporally  abstract  actions  in  reinforcement 
learning  by  the  combination  of  a  policy  and  a  ter¬ 
mination  condition,  which  we  refer  to  as  an  op¬ 
tion.  Value  functions  over  options  and  models  of 
options  can  be  learned  using  methods  designed 
for  semi-Markov  decision  processes  (SMDPs). 
However,  all  these  methods  require  an  option  to 
be  executed  to  termination.  In  this  paper  we  ex¬ 
plore  methods  that  learn  about  an  option  from 
small  fragments  of  experience  consistent  with 
that  option,  even  if  the  option  itself  is  not  exe¬ 
cuted.  We  call  these  methods  intra-option  learn¬ 
ing  methods  because  they  learn  from  experience 
within  an  option.  Intra-option  methods  are  some¬ 
times  much  more  efficient  than  SMDP  meth¬ 
ods  because  they  can  use  off-policy  temporal- 
difference  mechanisms  to  learn  simultaneously 
about  all  the  options  consistent  with  an  expe¬ 
rience,  not  just  the  few  that  were  actually  exe¬ 
cuted.  In  this  paper  we  present  intra-option  learn¬ 
ing  methods  for  learning  value  functions  over  op¬ 
tions  and  for  learning  multi-time  models  of  the 
consequences  of  options.  We  present  compu¬ 
tational  examples  in  which  these  new  methods 
learn  much  faster  than  SMDP  methods  and  learn 
effectively  when  SMDP  methods  cannot  learn  at 
all.  We  also  sketch  a  convergence  proof  for  intra¬ 
option  value  learning. 


1  Introduction 

Learning,  planning,  and  representing  knowledge  at  multi¬ 
ple  levels  of  temporal  abstraction  remain  key  challenges 
for  AI.  Recently,  several  researchers  have  begun  to  address 


these  challenges  within  the  framework  of  reinforcement 
learning  and  Markov  decision  processes  (MDPs)  (e.g., 
Singh,  1992a,b;  Kaelbling,  1993;  Lin,  1993;  Dayan  &  Hin¬ 
ton,  1993;  Thrun  and  Schwartz,  1995;  Sutton,  1995;  Hu¬ 
ber  and  Grupen,  1997;  Kalmdr,  Szepesvdri,  and  Lorincz, 
1997;  Dietterich,  1998;  Parr  and  Russell,  1998;  Precup, 
Sutton,  and  Singh  1997,  1998a,b).  This  framework  is  ap¬ 
pealing  because  of  its  general  goal  formulation,  applicabil¬ 
ity  to  stochastic  environments,  and  ability  to  use  sample 
or  simulation  models  (e.g.,  see  Sutton  and  Barto,  1998). 
Extensions  of  MDPs  to  semi-Markov  decision  processes 
(SMDPs)  provide  a  way  to  model  temporally  abstract  ac¬ 
tions,  as  we  summarize  in  Sections  3  and  4  below.  Com¬ 
mon  to  much  of  this  recent  work  is  the  modeling  of  a  tem¬ 
porally  extended  action  as  a  policy  (controller)  and  a  con¬ 
dition  for  terminating,  which  we  together  refer  to  as  an  op¬ 
tion.  Options  are  a  flexible  way  of  representing  temporally 
extended  courses  of  action  such  that  they  can  be  used  inter- 
achangeably  with  primitive  actions  in  existing  learning  and 
planning  methods  (Sutton,  Precup,  and  Singh,  in  prepara¬ 
tion). 

In  this  paper  we  explore  ways  for  learning  about  options 
using  a  class  of  off-policy,  temporal-difference  methods 
that  we  call  intra-option  learning  methods.  Intra-option 
methods  look  inside  options  to  learn  about  them  even 
when  only  a  single  action  is  taken  that  is  consistent  with 
them.  Whereas  SMDP  methods  treat  options  as  indivisi¬ 
ble  black  boxes,  intra-option  methods  attempt  to  take  ad¬ 
vantage  of  their  internal  structure  to  speed  learning.  Intra¬ 
option  methods  were  introduced  by  Sutton  (1995),  but  only 
for  a  pure  prediction  case,  with  a  single  policy. 

The  structure  of  this  paper  is  as  follows.  First  we  introduce 
the  basic  notation  of  reinforcement  learning,  options  and 
models  of  options.  In  Section  4  we  briefly  review  SMDP 
methods  for  learning  value  functions  over  options  and  thus 
how  to  select  among  options.  Our  new  results  are  in  Sec¬ 
tions  5-7.  Section  5  introduces  an  intra-option  method  for 
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learning  value  functions  and  sketches  a  proof  of  its  con¬ 
vergence.  Computational  experiments  comparing  it  with 
SMDP  methods  are  presented  in  Section  6.  Section  7  con¬ 
cerns  methods  for  learning  models  of  options,  as  are  used 
in  planning:  we  introduce  an  intra-option  method  and  illus¬ 
trate  its  advantages  in  computational  experiments. 


The  action  value  functions  satisfy  the  Bellman  equations: 
Q^{s,  a)  =  r“  -f  o.')Q^{s',  o')  (1) 

s'  a' 

Q*(s,o)  =  r“ 


2  Reinforcement  Learning  (MDP) 
Framework 

In  the  reinforcement  learning  framework,  a  learning  agent 
interacts  with  an  environment  at  some  discrete,  lowest-level 
time  scale  t  =  0, 1, 2, . . ..  At  each  time  step,  the  agent 
perceives  the  state  of  the  environment,  St  €  <S,  and  on  that 
basis  chooses  a  primitive  action,  at  €  As,  ■  response 
to  at,  the  environment  produces  one  step  later  a  numerical 
reward,  n+i  e  SR,  and  a  next  state,  st+i.  We  denote  the 
union  of  the  action  sets  by  A  =  U«e5  ^ 

are  finite,  then  the  environment’s  transition  dynamics  are 
modeled  by  one-step  state-transition  probabilities,  and  one- 
step  expected  rewards, 

=  Pr{si+i  =  s'  I  st  =  s,  at  =  a}  and 
fg  =  E{rt+i  \  St  =  s,at  =  a], 

for  all  s,  s'  G  <S  and  a  €  A  (it  is  understood  here  that 
=  0  for  a  A)-  These  two  sets  of  quantities  together 
constitute  the  one-step  model  of  the  environment. 

The  agent’s  objective  is  to  learn  a  policy  tt,  which  is  a 
mapping  from  states  to  probabilities  of  taking  each  action, 
that  maximizes  the  expected  discounted  future  reward  from 
each  state  s: 

y’^(s)  =  Ejrt  -l-7rt+i  -l-7^rt+2  H - j  St  =  s,7r|, 

where  7  G  [0, 1)  is  a  discount-rate  parameter.  The  quantity 
y’'(s)  is  called  the  value  of  state  s  under  policy  tt,  and 
is  called  the  value  function  for  policy  tt.  The  optimal  value 
of  a  state  is  denoted 

V*{s)  =maxy’'(s). 

TT 


Particularly  important  for  learning  methods  is  a  parallel 
set  of  value  functions  for  state— action  pairs  rather  than  for 
states.  The  value  of  taking  action  a  in  state  s  under  pol¬ 
icy  TT,  denoted  Q’^(s,  a),  is  the  expected  discounted  future 
reward  starting  in  s,  taking  a,  and  henceforth  following  tt: 


Q^(s,a)  =  -hjrt+i  H - 


Si  —  S^CLi  —  QrjTT 


}• 


This  is  known  as  the  action-value  function  for  policy  rc. 
The  optimal  action-value  function  is 


Q*{s,a)  =  max  Q'^{s,  a). 

TT 


3  Options 

We  use  the  term  options  for  our  generalization  of  primitive 
actions  to  include  temporally  extended  courses  of  action.  In 
this  paper,  we  focus  on  Markov  options,  which  consist  of 
three  components:  a  policy  tt  :  5  x  A  [0, 1],  a  termina¬ 
tion  condition  :  <S  i-4  [0, 1],  and  an  input  set  ICS.  An 
option  (I,  TT,  p)  is  available  in  state  s  if  and  only  if  s  G  I.  If 
the  option  is  taken,  then  actions  are  selected  according  to  tt 
until  the  option  terminates  stochastically  according  to  /3.  In 
particular,  if  the  option  taken  in  state  st  is  Markov,  then  the 
next  action  at  is  selected  according  to  the  probability  distri¬ 
bution  7r(st,  •)•  The  environment  then  makes  a  transition  to 
state  st+i,  where  the  option  either  terminates,  with  proba¬ 
bility  0{st+i),  or  else  continues,  determining  at+i  accord¬ 
ing  to  7r(st+i ,  •)>  possibly  terminating  in  st+2  according  to 
0{st+2),  and  so  on.  When  the  option  terminates,  then  the 
agent  has  the  opportunity  to  select  another  option. 

The  input  set  and  termination  condition  of  an  option  to¬ 
gether  restrict  its  range  of  application  in  a  potentially  use¬ 
ful  way.  In  particular,  they  limit  the  range  over  which  the 
option’s  policy  needs  to  be  defined.  For  example,  a  hand¬ 
crafted  policy  TT  for  a  mobile  robot  to  dock  with  its  battery 
charger  might  be  defined  only  for  states  I  in  which  the  bat¬ 
tery  charger  is  within  sight.  The  termination  condition  /3 
would  be  defined  to  be  1  outside  of  I  and  when  the  robot 
is  successfully  docked.  For  Markov  options  it  is  natural 
to  assume  that  all  states  where  an  option  might  continue 
are  also  states  where  the  option  might  be  taken  (i.e.,  that 
{s  :  P{s)  <  1}  C  I).  In  this  case,  it  needs  to  be  defined 
only  over  I  rather  than  over  all  of  S. 

Given  a  set  of  options,  their  input  sets  implicitly  define  a 
set  of  available  options  O*  for  each  state  s  G  5.  The  sets 
Os  are  much  like  the  sets  of  available  actions,  A*.  We  can 
unify  these  two  kinds  of  sets  by  noting  that  actions  can  be 
considered  a  special  case  of  options.  Each  action  a  corre¬ 
sponds  to  an  option  that  is  available  whenever  a  is  avail¬ 
able  (I  =  {s  :  a  G  A}),  that  always  lasts  exactly  one 
step  (/3(s)  =  1,  Vs  G  S),  and  that  selects  a  everywhere 
(7r(s,  a)  =  1,  Vs  G  I).  Thus,  we  can  consider  the  agent’s 
choice  at  each  time  to  be  entirely  among  options,  some  of 
which  persist  for  a  single  time  step,  others  which  are  more 
temporally  extended.  We  refer  to  the  former  as  one-step  or 
primitive  options  and  the  latter  as  multi-step  options. 
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We  now  consider  Markov  policies  over  options,  p,  :  S  x 
^  [0, 1].  and  their  value  functions.  When  initiated  in 

a  state  st,  such  a  policy  p  selects  an  option  o  e  Og,  ac¬ 
cording  to  probability  distribution  p{st,-).  The  option  o 
is  taken  in  st,  determining  actions  until  it  terminates  in 
at+fc,  at  which  point  a  new  option  is  selected,  according  to 
P-i^t+ki  •)’  and  so  on.  In  this  way  a  policy  over  options,  p, 
determines  a  policy  over  actions,  or  flat  policy,  n  =  f(p). 
Henceforth  we  use  the  unqualified  term  policy  for  Markov 
policies  over  options,  which  include  Markov  flat  policies  as 
a  special  case. 

Note,  however,  that  f{p)  is  typically  not  Markov  because 
the  action  taken  in  a  state  depends  on  which  option  is  being 
taken  at  the  time,  not  just  on  the  state.  We  define  the  value 
of  a  state  s  under  a  general  flat  policy  it  as  the  expected 
return  if  the  policy  is  started  in  s: 


f(7r, 


where  Silt,  s,  t)  denotes  the  event  of  n  being  initiated  in  s 
at  time  t.  The  value  of  a  state  under  a  general  policy  (i.e., 
a  policy  over  options)  p  can  then  be  defined  as  the  value 
of  the  state  under  the  corresponding  flat  policy:  V>^(s) 


state-prediction  part  of  the  model  of  o  for  state  s  is 

OO 

pL'  =  s'  I 

j=o 

-E{y'‘Ss>s^^^\£{o,s,t)},  (4) 

for  all  s'  E  S,  under  the  same  conditions,  where  Sgs'  is  an 
identity  indicator,  equal  to  1  if  s  =  s',  and  equal  to  0  else. 
Thus,  p°^.  is  a  combination  of  the  likelhood  that  s'  is  the 
state  in  which  o  terminates  together  with  a  measure  of  how 
delayed  that  outcome  is  relative  to  7.  We  call  this  kind  of 
model  a  multi-time  model  because  it  describes  the  outcome 
of  an  option  not  at  a  single  time  but  at  potentially  many 
different  times,  appropriately  combined. 

4  SMDP  Learning  Methods 

Using  multi-time  models  of  options  we  can  write  Bellman 
equations  for  general  policies  and  options.  For  example, 
the  Bellman  equation  for  the  value  of  option  o  in  state  s  El 
under  a  Markov  policy  p  is 

Q>'{s,o)  =  r1  -f  Y  p{s',o')Q^{s',o').  (5) 

«'  o'GO, 


It  is  natural  to  also  generalize  the  action-value  function  to 
an  option-value  function.  We  define  Q''(s,o),  the  value  of 
taking  option  o  in  state  s  El  under  policy  p,  as 


£{op,s,t 


)}. 


where  op,  the  composition  of  o  and  p,  denotes  the  policy 
that  first  follows  o  until  it  terminates  and  then  initiates  p  in 
the  resultant  state. 


Options  are  closely  related  to  the  actions  in  a  special  kind 
of  decision  problem  known  as  a  semi-Markov  decision  pro¬ 
cess,  or  SMDP  (e.g.,  see  Puterman,  1994).  Any  fixed  set 
of  options  for  a  given  MDP  defines  a  new  SMDP  overlaid 
on  the  MDP.  The  appropriate  form  of  model  for  options, 
analogous  to  the  r“  and  defined  earlier  for  actions,  is 
known  from  existing  SMDP  theory.  For  each  state  in  which 
an  option  may  be  started,  this  kind  of  model  predicts  the 
state  in  which  the  option  will  terminate  and  the  total  reward 
received  along  the  way.  These  quantities  are  discounted  in 
a  particular  way.  For  any  option  o,  let  £{o,s,  t)  denote  the 
event  of  o  being  taken  in  state  s  at  time  t.  Then  the  reward 
part  of  the  model  of  o  for  state  s  is 


The  optimal  value  functions  and  optimal  Bellman  equa¬ 
tions  can  also  be  generalized  to  options  and  to  policies  over 
options.  Of  course,  the  conventional  optimal  value  func¬ 
tions  V*  and  Q*  are  not  affected  by  the  introduction  of 
options;  one  can  ultimately  do  just  as  well  with  primitive 
actions  as  one  can  with  options.  Nevertheless,  it  is  inter¬ 
esting  to  know  how  well  one  can  do  with  a  restricted  set  of 
options  that  does  not  include  all  the  actions.  For  example, 
one  might  first  consider  only  high-level  options  in  order  to 
find  an  approximate  solution  quickly.  Let  us  denote  the  re¬ 
stricted  set  of  options  by  O  and  the  set  of  all  policies  that 
select  only  from  O  by  11(0).  Then  the  optimal  value  func¬ 
tion  given  that  we  can  select  only  from  O  is 


Vo(s)  =  max  V'^{s) 
uen{o) 


(6) 

(7) 


where  £(o,s)  denotes  the  event  of  starting  the  execution 
of  option  o  in  state  s,  k  is  the  random  numbner  opf  steps 
elapsing  during  o,  s'  is  the  resulting  next  state,  and  r  is  the 
cumulative  discounted  reward  received  along  the  way.  The 
optimal  option  values  are  defined  as: 


=  £{rt+i -f  7rt+2...  +  7*  Vf+t  |  5(o,s,f)},  (3) 
where  f  -f  A:  is  the  random  time  at  which  o  terminates.  The 


max  Q'‘(s,o) 
/ien(O) 


(8) 


=  £;{r-f7''m|xgj,(s',o')  I  f(o,s)}  (9) 
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Given  a  set  of  options,  O,  a  corresponding  optimal  pol¬ 
icy,  denoted  is  any  policy  that  achieves  i.e.,  for 
which  V^°{s)  =  VJ5(s)  for  all  states  s  G  <S.  If  Vq  and 
models  of  the  options  are  known,  then  optimal  policies 
can  he  formed  by  choosing  in  any  proportion  among  the 
maximizing  options  in  (7).  Or,  if  Q*q  is  known,  then  opti¬ 
mal  policies  can  be  formed  by  choosing  in  each  state  s  in 
any  proportion  among  the  options  o  for  which  Qq{s,  o)  — 
maxo-  Q^(s,o').  Thus,  computing  approximations  to 
or  Q*q  become  the  primary  goals  of  planning  and  learning 
methods  with  options. 

The  problem  of  finding  the  optimal  value  functions  for  a  set 
of  options  can  be  addressed  by  learning  methods.  Because 
an  MDP  augmented  by  options  forms  an  SMDP,  we  can  ap¬ 
ply  SMDP  learning  methods  as  developed  by  Bradtke  and 
Duff  (1995),  Parr  and  Russell  (1998),  Parr  (in  preparation), 
Mahadevan  et  al.  (1997),  and  McGovern,  Sutton  and  Fagg 
(1997).  In  these  methods,  each  option  is  viewed  as  an  in¬ 
divisible,  opaque  unit.  After  the  execution  of  option  o  is 
started  in  state  s,  we  next  jump  to  the  state  s'  in  which  it 
terminates.  Based  on  this  experience,  an  estimate  Q{s,o) 
of  the  optimal  option- value  function  is  updated.  For  exam¬ 
ple,  the  SMDP  version  of  one-step  Q-leaming  (Bradtke  and 
Duff,  1995),  which  we  call  one-step  SMDP  Q-leaming,  up¬ 
dates  after  each  option  termination  by 


Q{s,o)  •«-  Q(s,o)4-q: 


r  -I-  7*  max  Q(s',  o')  —  Q{s,  o) 
o'eo 


where  k  is  the  number  of  time  steps  elapsing  between  s  and 
s',  r  is  the  cumulative  discounted  reward  over  this  time,  and 
it  is  implicit  that  the  step-size  parameter  a  may  depend  ar¬ 
bitrarily  on  the  states,  option,  and  time  steps.  The  estimate 
Q(s,  o)  converges  to  Qq{s,  o)  for  all  s  €  <S  and  o  G  O  un¬ 
der  conditions  similar  to  those  for  conventional  Q-leaming 
(Parr,  in  preparation). 


the  consequences  of  one  policy  while  actually  behaving  ac¬ 
cording  to  another,  potentially  different  policy.  Intra-option 
methods  can  be  used  to  learn  simultaneously  about  many 
different  options  from  the  same  experience.  Moreover,  they 
can  leam  about  the  values  of  executing  options  without  ever 
executing  those  options. 

Intra-option  methods  for  value  learning  are  potentially 
more  efficient  than  SMDP  methods  because  they  extract 
more  training  examples  from  the  same  experience.  For  ex¬ 
ample,  suppose  we  are  learning  to  approximate  Qq{s,o) 
and  that  o  is  Markov.  Based  on  an  execution  of  o  from  t  to 
t  -1-  k,  SMDP  methods  extract  a  single  training  example  for 
Qo{s,  o).  But  because  o  is  Markov,  it  is,  in  a  sense,  also 
initiated  at  each  of  the  steps  between  t  and  f-t-  fc.  The  jumps 
from  each  intermediate  st+j  to  St+k  ^so  valid  experi¬ 
ences  with  o,  experiences  that  can  be  used  to  improve  es¬ 
timates  of  Qo{st+i,  o).  Or  consider  an  option  that  is  very 
similar  to  o  and  which  would  have  selected  the  same  ac¬ 
tions,  but  which  would  have  terminated  one  step  later,  at 
t-\-k-\-l  rather  than  att-t-k.  Formally  this  is  a  different 
option,  and  formally  it  was  not  executed,  yet  all  this  experi¬ 
ence  could  be  used  for  learning  relevant  to  it.  In  fact,  an  op¬ 
tion  can  often  leam  something  from  experience  that  is  only 
slightly  related  (occasionally  selecting  the  same  actions)  to 
what  would  be  generated  by  executing  the  option.  This  is 
the  idea  of  off-policy  training — to  make  full  use  of  what¬ 
ever  experience  occurs  in  order  to  leam  as  much  possible 
about  all  options,  irrespective  of  their  role  in  generating  the 
experience.  To  make  the  best  use  of  experience  we  would 
like  an  off-policy  and  intra-option  version  of  Q-leaming. 

It  is  convenient  to  introduce  new  notation  for  the  value  of  a 
state-option  pair  given  that  the  option  is  Markov  and  exe¬ 
cuting  upon  arrival  in  the  state: 

Uo{s,o)  =  {1  -  l3{s))Qh{s,o)  -t-  0{s)maxQ*ois,o'), 


5  Intra-Option  Value  Learning 

One  drawback  to  SMDP  learning  methods  is  that  they  need 
to  execute  an  option  to  termination  before  they  can  leam 
about  it.  Because  of  this,  they  can  only  be  applied  to  one 
option  at  a  time — the  option  that  is  executing  at  that  time. 
More  interesting  and  potentially  more  powerful  methods 
are  possible  by  taking  advantage  of  the  sfructure  inside 
each  option.  In  particular,  if  the  options  are  Markov  and 
we  are  willing  to  look  inside  them,  then  we  can  use  spe¬ 
cial  temporal-difference  methods  to  leam  usefully  about  an 
option  before  the  option  terminates.  This  is  the  main  idea 
behind  intra-option  methods. 

Intra-option  methods  are  examples  of  off-policy  learning 
methods  (Sutton  and  Barto,  1998)  in  that  they  leam  about 


Then  we  can  write  Bellman-like  equations  that  relate 
Qo{s,o)  to  expected  values  of  Qois',o),  where  s'  is  the 
immediate  successor  to  s  after  initiating  Markov  option 
o=  (I,7r,/))  ins: 


Qo{s,o)  =  ^  n{s,a)E^^r-\-'rUo{s',o)  s,o| 


aeA, 

o€.4. 


where  r  is  the  immediate  reward  upon  arrival  in  s'.  Now 
consider  learning  methods  based  on  this  Bellman  equa¬ 
tion.  Suppose  action  at  is  taken  in  state  s*  to  produce 
next  state  st+i  and  reward  rt+i,  and  that  at  was  selected 
in  a  way  consistent  with  the  Markov  policy  tt  of  an  option 
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o  -  (I,  TT,  ^).  That  is,  suppose  that  at  was  selected  accord¬ 
ing  to  the  distribution  iT{st,  ■).  Then  the  Bellman  equation 
above  suggests  applying  the  off-policy  one-step  temporal- 
difference  update: 

Q{st,o)  Q(s«,o)-ha[^(r(+i-f7[/(sf+i,o))-Q(sj,o)j, 
where 

U{s,  o)  =  (1  -  0{s))Q{s,  o)  +  P(s)  max  Q(s,  o') 

o'eo 

The  method  we  call  one-step  intra-option  Q-leaming  ap¬ 
plies  this  update  rule  to  every  option  o  consistent  with  every 
action  taken  at . 

Theorem  1  (Convergence  of  intra-option  Q-leaming) 

For  any  set  of  deterministic  Markov  options  O,  one-step 
intra-option  Q-leaming  converges  w.p.I  to  the  optimal 
Q-values,  Qq,  for  every  option,  regardless  of  what  options 
are  executed  during  learning,  provided  every  primitive 
action  gets  executed  in  every  state  infinitely  often. 

Proof:  (Sketch)  On  experiencing  (a,  a,  r,  s'),  for  every  op¬ 
tion  0  that  picks  action  a  in  state  s,  intra-option  Q-learning 
performs  the  following  update: 

Q(s,o)  t-  Q{s,o)  +  a(s,o)[r  -H  'yU(s',o)  -  <3(s,o)]. 

Let  a  be  the  action  selection  by  deterministic  Markov  op¬ 
tion  0  =  (I,7r,/9).  Our  result  follows  directly  from  Theo¬ 
rem  1  of  Jaakkola  et  al.  (1994)  and  the  observation  that  the 
expected  value  of  the  update  operator  r  +  'yU{s',  o)  yields 
a  contraction,  as  shown  below: 

|£;{r-t-7(7(s',o)}  -QJ,(s,o)| 

=  K  +  Y.P'is'Uis',0)  -  r“  -  ^p“,-t/5(a',o)| 

«'  8' 

<  I  J2Pss'[a  -  /3is')){Q{s',o)  -  Qhis',o)) 

8' 

+  ^(s') (m|g  o')  -  max  Qo(s',  o'))]  | 

^  \Q{s",o")  -  Qhis",o")\ 

8  ,0 

-  T'm^|Q(s",o")-QJ,(s",o")|  o 

6  Illustrations  of  Intra-Option  Value 
Learning 

As  an  illustration  of  intra-option  value-learning,  we  used 
the  gridworld  environment  shown  in  Figure  1.  The  cells  of 


Figure  1 :  The  rooms  example  is  a  gridworld  environment 
with  stochastic  cell-to-cell  actions  and  room-to-room  hall¬ 
way  options.  Two  of  the  hallway  options  are  suggested  by 
the  arrows  labeled  o\  and  02.  The  label  G  indicates  the 
location  used  as  a  goal. 


the  grid  correspond  to  the  states  of  the  environment.  From 
any  state  the  agent  can  perform  one  of  four  actions,  up, 
down,  left  or  right,  which  have  a  stochastic  effect. 
With  probability  2/3,  the  actions  cause  the  agent  to  move 
one  cell  in  the  corresponding  direction,  and  with  probabil¬ 
ity  1/3,  the  agent  moves  instead  in  one  of  the  other  three  di¬ 
rections,  each  with  1/9  probability.  If  the  movement  would 
take  the  agent  into  a  wall,  then  the  agent  remains  in  the 
same  cell.  There  are  small  negative  rewards  for  each  ac¬ 
tion,  with  means  uniformly  distributed  between  0  and  -1. 
The  rewards  are  also  perturbed  by  gaussian  noise  with  stan¬ 
dard  deviation  0.1.  The  environment  also  has  a  goal  state, 
labeled  “G”.  A  complete  trip  from  a  random  start  state  to 
the  goal  state  is  called  an  episode.  When  the  agent  enters 
“G”,  it  gets  a  reward  of  1  and  the  episode  ends.  In  all  the 
experiments  the  discount  parameter  was  7  =  0.9  and  all 
the  initial  value  estimates  were  0. 

In  each  of  the  four  rooms  we  provide  two  built-in  hallway 
options  designed  to  take  the  agent  from  anywhere  within 
the  room  to  one  of  the  two  hallway  cells  leading  out  of 
the  room.  The  policies  underlying  the  options  follow  the 
shortest  expected  path  to  the  hallway. 

For  the  first  experiment,  we  applied  the  intra-option  method 
in  this  environment  without  selecting  the  hallway  options. 
In  each  episode,  the  agent  started  at  a  random  state  in  the 
environment  and  thereafter  selected  primitive  actions  ran¬ 
domly,  with  equal  probability.  On  every  transition,  the  up¬ 
date  (5)  was  applied  first  to  the  primitive  action  taken,  then 
to  any  of  the  hallway  options  that  were  consistent  with  it. 
The  hallway  options  were  updated  in  clockwise  order,  start¬ 
ing  from  any  hallways  that  faced  up  from  the  current  state. 
The  value  of  the  step-size  parameter  was  a  =  0.01. 

This  is  a  case  in  which  SMDP  methods  would  not  be  able  to 
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Figure  2:  The  learning  of  option  values  by  intra-option 
methods  without  ever  selecting  the  options.  The  value  of 
the  greedy  policy  goes  to  the  optimal  value  (upper  panel) 
as  the  learned  values  approach  the  correct  values  (as  shown 
for  one  state,  in  the  lower  panel). 


learn  anything  about  the  halfway  options,  because  these  op¬ 
tions  are  never  executed.  However,  the  intra-option  method 
learned  the  values  of  these  actions  effectively,  as  shown  in 
Figure  2.  The  upper  panel  shows  the  value  of  the  greedy 
policy  learned  by  the  intra-option  method,  averaged  over  I 
and  over  30  repetitions  of  the  whole  experiment.  The  lower 
panel  shows  the  correct  and  learned  values  for  the  two  hall¬ 
way  options  that  apply  in  the  state  marked  *  in  Figure  1. 
Similar  convergence  to  the  true  values  was  observed  for  all 
the  other  states  and  options. 

So  far  we  have  illustrated  the  effectiveness  of  intra-option 
learning  in  a  context  in  which  SMDP  methods  do  not  ap¬ 
ply.  How  do  intra-option  methods  compare  to  SMDP  meth¬ 
ods  when  both  are  applicable?  In  order  to  investigate  this 
question,  we  used  the  same  environment,  but  now  we  al¬ 
lowed  the  agent  to  choose  among  the  hallway  options  as 
well  as  the  primitive  actions,  which  were  treated  as  one- 
step  options.  In  this  case,  SMDP  methods  can  be  ap¬ 


Figure  3:  Comparison  of  SMDP,  intra-option  and  macro  Q- 
leaming.  Intra-option  methods  converge  faster  to  the  cor¬ 
rect  values. 


plied,  since  all  the  options  are  actually  executed.  We  ex¬ 
perimented  with  two  SMDP  methods:  one-step  SMDP  Q- 
leaming  (Bradtke  and  Duff,  1995)  and  a  hierarchical  form 
of  Q-leaming  called  macro  Q-leaming  (McGovern,  Sutton 
and  Fagg,  1997).  The  difference  between  the  two  methods 
is  that,  when  taking  a  multi-step  option,  SMDP  Q-leaming 
only  updates  the  value  of  that  option,  whereas  macro  Q- 
leaming  also  updates  the  values  of  the  one-step  options  (ac¬ 
tions)  that  were  taken  along  the  way. 


In  this  experiment,  options  were  selected  not  at  random,  but 
in  an  e-greedy  way  dependent  on  the  current  option- value 
estimates.  That  is,  given  the  current  estimates  Q{s,  6),  let 
o*  =  argmaxogo,  Q{s,o)  denote  the  best  valued  action 
(with  ties  broken  randomly).  Then  the  policy  used  to  select 
options  was 


p.{s,  o) 


if  o  =  0* 
otherwise. 


for  all  s  e  <S  and  o  £  O.  The  probability  of  a  random 
action,  e,  was  set  at  0.1  in  all  cases.  For  each  algorithm. 
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we  tried  step-size  values  of  a  =  and  and  then 

picked  the  best  one. 

Figure  3  shows  two  measures  of  the  performance  of  the 
learning  algorithms.  The  upper  panel  shows  the  average 
absolute  error  in  the  estimates  of  Q*^  for  the  hallway  op¬ 
tions,  averaged  over  the  input  sets  I,  the  eight  hallway 
options,  and  30  repetitions  of  the  whole  experiment.  The 
intra-option  method  showed  significantly  faster  learning 
than  any  of  the  SMDP  methods.  The  lower  panel  shows  the 
quality  of  the  policy  executed  by  each  method,  measured 
as  the  average  reward  over  the  state  space.  The  intra-option 
method  was  also  the  fastest  to  learn  by  this  measure. 


7  Intra-Option  Model  Learning 

In  this  section,  we  consider  intra-option  methods  for  learn¬ 
ing  multi-time  models  of  options,  r°  and  p°^, ,  given  knowl¬ 
edge  of  the  option  (i.e.,  of  its  tt,  (3,  and  I).  Such  models  are 
used  in  planning  methods  (e.g..  Precup,  Sutton,  and  Singh, 
1997,  1998a,b). 


o  —  (I,  TT,  /?)  is  related  to  itself  by 


=  ^  7r(s,o) 


(12) 

(13) 


a€A,  L  «'  J 

where  r  and  s'  are  the  reward  and  next  state  given  that  ac¬ 
tion  a  is  taken  in  state  s,  and 


P°x=  Y.  -i3(s'))K-x  +i9(s')'^s'x} 

s’ 


for  all  s,x  E  S.  How  can  we  turn  these  Bellman  equations 
into  update  rules  for  learning  the  model?  First  consider  that 
action  at  is  taken  in  st  and  that  the  way  it  was  selected  is 
consistent  with  o  —  (I,  tt,/?),  that  is,  that  at  was  selected 
with  the  distribution  n(st,  •).  Then  the  Bellman  equations 
above  suggest  the  temporal-difference  update  rules 


The  most  straightforward  approach  to  learning  the  model 
of  an  option  is  to  execute  the  option  to  termination  many 
times  in  each  state  s,  recording  the  resultant  next  states 
s',  cumulative  discounted  rewards  r,  and  elapsed  times  k. 
These  outcomes  can  then  be  averaged  to  approximate  the 
expected  values  for  r°  and  given  by  (3)  and  (4).  For 
example,  an  incremental  learning  rule  for  this  could  update 
its  estimates  f°  and  p%,  for  all  x  e  S,  after  each  execution 
of  o  in  state  s,  by 


and  (10) 

Pax  ~  ^[7  ^xa'  ~Psx]i  (II) 

where  the  step-size  parameter,  a,  may  be  constant  or  may 
depend  on  the  state,  option,  and  time.  For  example,  if  a  is  1 
divided  by  the  number  of  times  that  o  has  been  experienced 
in  s,  then  these  updates  maintain  the  estimates  as  sample 
averages  of  the  experienced  outcomes.  However  the  aver¬ 
aging  is  done,  we  call  these  SMDP  model-learning  meth¬ 
ods  because,  like  SMDP  value-learning  methods,  they  are 
based  on  jumping  from  initiation  to  termination  of  each  op¬ 
tion,  ignoring  what  might  happen  along  the  way.  In  the  spe¬ 
cial  case  in  which  o  is  a  primitive  aetion,  note  that  SMDP 
model-learning  methods  reduce  exactly  to  those  used  to 
learn  conventional  one-step  models  of  actions. 

Now  let  us  consider  intra-option  methods  for  model  learn¬ 
ing.  The  idea  is  to  use  Bellman  equations  for  the  model, 
just  as  we  used  the  Bellman  equations  in  the  case  of  learn¬ 
ing  value  functions.  The  correct  model  of  a  Markov  option 


fs.  +Q 


'’(-n  +7(1  -  0{st-ri))f\ 


»l  +  l 


and 


Pa.x  <-  Pa,x  +  a[7(l  -  ^(S<-H))j5°  + 

10{st+i)6s,^tx-p°J>  (15) 

where  and  are  the  estimates  of  p°^,  and  r°,  re¬ 
spectively,  and  Q  is  a  positive  step-size  parameter.  The 
method  we  call  one-step  intra-option  model  learning  ap¬ 
plies  these  updates  to  every  option  consistent  with  every 
action  taken.  Of  course,  this  is  just  the  simplest  intra-option 
model-learning  method.  Others  may  be  possible  using  el¬ 
igibility  traces  and  standard  tricks  for  off-policy  learning 
(see  Sutton,  1995;  Sutton  and  Barto,  1998). 

Intra-option  methods  for  model  learning  have  advantages 
over  SMDP  methods  similar  to  those  we  saw  earlier  for 
value-learning  methods.  As  an  illustration,  consider  the  ap¬ 
plication  of  SMDP  and  intra-option  model-learning  meth¬ 
ods  to  the  rooms  example.  We  assume  that  the  eight  hall¬ 
way  options  are  given  as  before,  but  now  we  assume  that 
their  models  are  not  given  and  must  be  learned.  Experience 
is  generated  by  selecting  randomly  in  each  state  among  the 
two  possible  options  and  four  possible  actions,  with  no  goal 
state.  In  the  SMDP  model-learning  method,  equations  (10) 
and  (11)  were  applied  whenever  an  option  was  selected, 
whereas,  in  the  intra-option  model-learning  method,  equa¬ 
tions  (14)  and  (15)  were  applied  on  every  step  to  all  options 
that  were  consistent  with  the  action  taken  on  that  step.  In 
this  example,  all  options  are  deterministic,  so  consistency 


Intra-Option  Learning  about  Temporally  Abstract  Actions 


563 


Figure  4:  Learning  curves  for  model  learning  by  SMDP 
and  intra-option  methods. 


with  the  action  selected  means  simply  that  the  option  would 
have  selected  that  action. 

For  the  SMDP  method,  the  step-size  parameter  was  varied 
so  that  the  model  estimates  were  sample  averages,  which 
should  give  fastest  learning.  The  results  of  this  method 
are  labeled  “SMDP  1/t”  on  the  graphs.  We  also  looked 
at  results  using  a  fixed  learning  rate.  In  this  case  and 
for  the  intra-option  method  we  tried  step-size  values  of 
a  =  i,  i,  |,  and  i,  and  picked  the  best  value  for  each 
method.  Figure  4  shows  the  learning  curves  for  all  three 
methods,  using  the  best  a  values,  when  a  fixed  alpha  was 
used.  The  upper  panel  shows  the  average  and  maximum  ab¬ 
solute  error  in  the  reward  predictions,  and  the  lower  panel 
shows  the  average  absolute  error  and  the  maximum  abso¬ 
lute  error  in  the  transition  predictions,  averaged  over  the 
eight  options  and  over  30  independent  runs.  The  intra¬ 
option  method  approached  the  correct  values  more  rapidly 
than  the  SMDP  methods. 


8  Closing 

TTie  theoretical  and  empirical  results  presented  in  this  pa¬ 
per  suggest  that  intra-option  methods  provide  an  efficient 
way  for  taking  advantage  of  the  structure  inside  an  option. 
Intra-option  methods  use  experience  with  a  single  action 
to  update  the  value  or  model  for  all  the  options  that  are 
consistent  with  that  action.  In  this  way  they  make  much 
more  efficient  use  of  the  experience  than  SMDP  methods, 
which  treat  options  as  indivisible  units.  In  the  future,  we 
plan  to  extend  these  algorithms  for  the  case  of  non-Markov 
options,  and  to  combine  them  with  eligibility  traces. 
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Abstract 

This  paper  presents  an  innovative  application  cf 
the  Disciple  Learning  Agent  Shell  to  the 
building  of  an  educational  agent  that  generates 
history  tests  for  middle  school  students,  to  assist 
in  the  assessment  of  their  understanding  and  use 
of  higher-order  thinking  skills.  Disciple  has  been 
taught  by  an  educator  to  generate  and  answer 
basic  test  questions  and  to  explain  the  answers. 

From  its  interaction  with  the  educational  expert. 
Disciple  has  learned  general  rules  that  allow  it  to 
generate  a  large  number  of  new  test  questions  for 
students,  together  with  hints,  answers,  and  exp¬ 
lanations  of  the  answers.  As  a  result,  it  can  guide 
the  students  during  their  practice  of  higher-order 
thinking  skills  as  they  would  be  directly  guided 
by  the  educator.  It  can  also  be  used  by  the  edu¬ 
cator  to  generate  a  different  exam  for  each  student 
in  the  class.  Disciple  has  been  experimentally 
evaluated  by  history  experts,  students  and  tea¬ 
chers,  with  very  promising  results.  The  work  on 
developing  this  educational  agent  illustrates  an 
integration  of  machine  learning,  knowledge 
acquisition,  problem  solving  and  intelligent  tu¬ 
toring  systems  in  the  context  of  computer-based 
assessment  involving  multimedia  documents. 

1  INTRODUCTION 

For  several  years  we  have  been  developing  the  Disciple 
approach  for  building  intelligent  agents.  The  defining 
feature  of  the  Disciple  approach  to  building  agents  is  that 
a  person  teaches  the  agent  how  to  perform  domain-specific 
tasks,  by  giving  the  agent  examples  and  explanations,  as 
well  as  supervising  and  correcting  its  behavior.  The 
current  version  of  the  Disciple  approach  is  implemented  in 
the  Disciple  Learning  Agent  Shell,  and  is  presented  in 
(Tecuci,  1998).  We  define  a  learning  agent  shell  as 
consisting  of  a  learning  engine  and  an  inference  engine 
that  support  a  repre'sentation  formalism  in  which  a 
knowledge  base  can  be  encoded,  as  well  as  a  methodology 
for  building  the  knowledge  base. 

The  central  goal  of  the  Disciple  approach  is  to  facilitate 
the  agent  building  process  by  the  use  of  synergism  at 
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three  different  levels.  First,  there  is  synergism  between 
different  learning  methods  employed  by  the  agent 
(Michalski  and  Tecuci,  1994).  By  integratmg 
complementary  learning  methods  (such  as  inductive 
learning  from  examples,  explanation-based  learning, 
learning  by  analogy,  learning  by  experimentation),  the 
Disciple  agent  is  able  to  learn  from  the  human  expert  in 
situations  in  which  no  single  strategy  learning  method 
would  be  sufficient.  Second,  there  is  synergism  between 
expert’s  teaching  of  the  agent  and  the  agent’s  learning 
from  the  expert  (Tecuci  and  Kodratoff,  1995).  For 
instance,  the  expert  may  select  representative  examples  to 
teach  the  agent,  may  provide  explanations,  and  may 
answer  agent’s  questions.  The  agent,  on  the  other  hand, 
will  learn  general  rules  that  are  difficult  to  be  defined  by 
the  expert,  and  will  consistently  integrate  them  into  its 
knowledge  base.  Third,  there  is  synergism  between  the 
expert  and  the  agent  in  solving  a  problem.  They  form  a 
team  in  which  the  agent  solves  the  more  routine  but  labor 
intensive  parts  of  the  problem  and  the  expert  solves  the 
more  creative  parts.  In  the  process,  the  agent  learns  from 
the  expert,  gradually  evolving  toward  an  "intelligent" 
agent  (Mitchell  et  al.,  1985).  We  claim  that  the  Disciple 
approach  significantly  reduces  the  involvement  of  the 
knowledge  engineer  in  the  process  of  building  an 
intelligent  agent,  most  of  the  work  being  done  directly  by 
the  domain  expert.  In  this  respect,  the  work  on  Disciple  is 
part  of  a  long  term  vision  where  personal  computer  users 
will  no  longer  be  simply  consumers  of  ready-made 
software,  as  they  are  today,  but  also  developers  of  their 
own  software  assistants. 

This  paper  is  organized  as  follows.  The  next  section 
presents  the  developed  test  generation  agent.  Then, 
sections  3,  4  and  5  describe  the  process  of  building  the 
agent.  Section  6  describes  the  results  of  the  experiments 
performed  with  the  developed  agent.  Finally,  the  paper 
presents  the  conclusions  of  this  work. 

2  A  TEST  GENERATION  AGENT 

We  have  developed  an  agent  that  generates  history  tests  to 
assist  in  the  assessment  of  students’  understanding  and 
use  of  higher-order  thinking  skills.  Examples  of  specific 
higher-order  thinking  skills  are:  evaluation  of  historical 
sources  for  relevance,  credibility,  consistency,  ambiguity. 
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bias,  and  fact  vs.  opinion;  analyzing  them  for  content, 
meaning  and  point  of  view;  and  synthesizing  arguments  in 
the  form  of  conclusions,  claims  and  assertions  (Bloom 
1956;  Beyer,  1987,  1988). 

To  motivate  the  middle  school  students,  for  which  this 
agent  was  developed,  and  to  provide  an  element  of  game 
playing,  the  agent  employs  a  journalist  metaphor,  asking 
the  students  to  assume  the  role  of  a  novice  Journalist. 
Figure  1,  for  instance,  shows  a  test  question  generated  by 
the  agent.  The  student  is  asked  to  imagine  that  he  or  she 
is  a  reporter  and  has  been  assigned  the  task  to  write  an 
article  for  Christian  Recorder  during  the  Civil  War  period 
on  plantations.  The  student  has  to  analyze  the  historical 
source  “Slave  Quarters”  in  order  to  determine  whether  it 
is  relevant  to  this  task.  In  the  situation  illustrated  in 
Figure  1  the  student  answered  correctly.  Therefore,  the 
agent  confirmed  the  answer  and  provided  an  explanation 
for  it,  as  indicated  in  the  lower  right  pane  of  the  window. 
The  student  could  have  requested  a  hint  to  answer  the 
question  and  would  have  received  the  following  one:  “To 
determine  if  the  source  is  relevant  to  your  task  investigate 
if  it  illustrates  some  component  of  a  plantation,  check 
when  it  was  created  and  when  Christian  Recorder  was 
issued.”  In  general,  there  may  be  several  reasons  why  a 
source  is  relevant  to  a  task.  By  pushing  the  More  button, 
the  student  can  receive  the  hints  and  explanations 
corresponding  to  these  additional  reasons. 


Another  example  ofa  test  question  is  shown  in  Figure  2. 
The  student  is  given  a  task,  a  historical  source  and  three 
possible  reasons  why  the  source  is  relevant  to  the  task.  He 
or  she  has  to  investigate  the  source  and  decide  which 
reason(s)  account  for  the  fact  that  the  source  is  relevant  to 
the  task.  The  student  is  instructed  to  check  the  box  next 
to  the  correct  reason(s). 

The  agent  has  two  modes  of  operation:  final  exam  mode 
and  self-assessment  mode.  In  the  final  exam  mode,  the 
agent  generates  an  exam  consisting  of  a  set  of  test 
questions  of  different  levels  of  difficulty.  The  student  has 
to  answer  one  test  question  at  a  time  and,  after  each 
question,  he  or  she  receives  the  correct  answer  and  an 
explanation  of  the  answer.  In  the  self-assessment  mode, 
the  student  chooses  the  type  of  test  question  to  solve,  and 
will  receive,  on  request,  feedback  in  the  form  of  hints  to 
answer  the  question,  the  correct  answer,  and  some  or  all 
the  explanations  of  the  answer.  The  test  questions  arc 
generated  such  that  all  students  interacting  with  the  agent 
are  likely  to  receive  different  tests  even  if  they  follow 
exactly  the  same  interaction  pattern.  Moreover,  the  agent 
builds  and  maintains  a  simple  student  model  and  uses  it 
in  the  process  of  test  generation.  For  instance,  to  the 
extent  possible,  the  agent  tries  to  generate  test  questions 
Aat  involve  historical  sources  that  have  not  been 
investigated  by  the  student,  or  historical  sources  that  were 
not  used  in  previous  tests  for  that  student. 


Figure  1 :  A  test  question,  answer  and  explanation  generated  by  the  agent' 
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Applying  What  You've  Learned  About  Relevance 
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Figure  2:  Another  test  question* 


The  next  sections  present  the  process  of  building  this 
agent;  building  the  agent’s  ontology  (Gruber,  1993), 
teaching  the  agent  how  to  generate  test  questions,  and 
building  the  test  generation  engine. 

3  BUILDING  THE  AGENT’S  ONTOLOGY 

The  agent’s  ontology  contains  descriptions  of  historical 
concepts  (such  as  “plantation”),  historical  sources  (such 
as  “Slave  Quarters”  in  Figure  1),  and  templates  fer 
reporter  tasks  (such  as  “You  are  a  writer  for  PUBLICATION 
during  HISTORICAL-PERIOD  and  you  have  been  assigned  to 
write  and  illustrate  a  feature  article  on  SLAVERY-TOPIC.”). 
Using  these  descriptions  and  templates,  the  agent 
communicates  with  the  students  through  a  stylized  natural 
language,  as  Illustrated  in  Figure  1  and  Figure  2. 

The  ontology  building  process  starts  with  choosing  a 
module  in  a  history  curriculum  (such  as  Slavery  in 
America)  for  which  the  agent  will  generate  test  questions. 
Then  the  educator  identifies  a  set  of  historical  concepts 
that  are  appropriate  and  necessary  to  be  learned  by  the 
students.  The  educator  also  identifies  a  set  of  historical 
sources  that  can  enhance  the  student’s  underst^ding  of 
these  concepts  and  will  be  used  in  test  questions.  All 
these  concepts  and  historical  sources  are  represented  by 


the  history  educator  in  the  knowledge  base,  by  using  the 
various  interfaces  of  Disciple.  One  is  the  Source  Viewer 
that  displays  the  historical  sources.  Another  is  the 
Concept  Editor  that  is  used  to  describe  the  historical 
sources.  The  historical  sources  have  to  be  defined  in  terms 
of  features  that  are  necessary  for  applying  the  higher-order 
thinking  skills  of  relevance,  credibility,  etc.  For  instance, 
a  source  is  relevant  to  some  topic  if  it  identifies, 
illustrates  or  explains  the  topic  or  some  of  its 
components.  Let  us  consider  the  historical  source 
‘Contented  Slaves  and  Masters’,  from  the  bottom  cf 
Figure  3.  This  source  is  defined  as  being  a  LITHOGRAPH 
that  ILLUSTRATES  the  concepts  SLAVE-DANCE,  MALE- 
SLAVE,  FEMALE-SLAVE,  and  SLAVE-MASTER.  Other 
information  has  also  to  be  represented,  such  as  the 
audience  for  which  this  source  is  appropriate  and  when  it 
was  created.  The  concepts  from  the  knowledge  base  are 
hierarchically  organized  in  a  semantic  network  (Quillian, 
1968;  Lenat  and  Guha,  1990)  that  can  be  inspected  with 
the  Concept  Browser.  For  instance,  SLAVE-DANCE  was 
defined  as  being  a  type  of  slave-recreation  which,  in 
turn,  was  a  SLAVE-LIFE-ASPECT.  This  initial  knowledge 
base  of  the  agent  was  assumed  to  be  incomplete  and  even 
possibly  partially  incorrect,  needing  to  be  improved 
during  the  next  stages  of  the  agent’s  development. 
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4  TEACHING  THE  AGENT 


4.1  RULE  LEARNING 


A  basic  relevancy  test  question  consists  of  judging  the 
relevancy  of  a  historical  source  to  a  given  reporter’s  task. 
To  teach  the  agent  to  generate  and  answer  such  questions, 
the  educator  gives  it  an  example  consisting  of  a  task  and  a 
historical  source  relevant  to  that  task,  as  shown  in  Figure 
3.  Starting  from  this  example,  the  agent  has  learned  the 
relevancy  rule  in  Figure  4,  where  the  condition  specifies  a 
general  reporter  task  and  the  conclusion  specifies  a  source 
relevant  to  that  task.  The  condition  also  incorporates  the 
explanation  of  why  the  source  is  relevant  to  the  task. 
Associated  with  the  rule  are  the  natural  language 
templates  corresponding  to  its  task,  explanation  and  con¬ 
clusion.  They  are  automatically  created  from  the  natural 
language  descriptions  of  the  elements  in  the  rule.  One 
should  notice  that  each  rule  corresponds  to  a  certain  type 
of  task  (WRITE-DURING-PERIOD,  in  this  case).  Other  types 
of  tasks  are  write-on-topic,  write-for-audience,  and 
WRITE-FOR-OCCASION.  Therefore,  for  each  type  of  reporter 
task  there  will  be  a  family  of  related  relevancy  rules.  The 
rules  corresponding  to  the  other  evaluation  criteria,  such 
as  credibility,  accuracy,  or  bias,  will  have  a  similar  form. 

□  ’  '  .  .iCumnt Example  ...  .  ,1  ■'  ;H 

If  you  ar«  a  writer  FOR  Southern  fT) 

Illustrated  News  DURINQ  th«  Civil  War 
l>arlod  (1861  •  laeS)  and  you  hava  boon 
Mtlgnad  to  wrlta  and  llluatrata  a  faatura 
■rttela  on  alava  cultura,  than  tha 
HISTORICAL  SOURCE  ’Contantad  Slavas 
and  Mastars*  la  ralavant. 


[n  ii  *Contefitci]  Slaves  «n0  '  B 


Figure  3:  Initial  example  given  by  the  educator’ 


IF 

?W1  IS  WRITE-DURING-PERIOD,  FOR  ?SI,  DURING '’PI  ON '’52 
?SI  IS  PUBLICATION,  ISSUED-DURING  ?P I 
7PI  IS  raSTORICAL-PERIOD 

?S2  IS  SLAVERY-TOPIC 

?S3  IS  SOURCE,  ILLUSTRATES  ?S4,  CREATED-DURING  '’P2 
?S4  IS  HISTORICAL-CONCEPT,  COMPONENT-OF  '’S2 

?P2  IS  HISTORICAL-PERIOD,  BEFORE  ?PI 

THEN 

RELEVANT  HIST-SOURCE  7S3 

Task  Description;  You  are  a  writer  for  ?S1  during  7PI  and  you  have  been 
assigned  to  write  and  illustrate  a  feature  article  on  ?S2 

Explanation;  7S3  illustrates  7S4  which  was  a  component  of  7S2,  ?S3  was 
created  during  7P2  which  was  before  7P1  and  7SI  was  issued  during  7P1. 

Operation  Description;  7S3  is  relevant 


Figure  4:  A  relevancy  rule 
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The  rule  learning  method  of  Disciple  is  schematically 
represented  in  Figure  5.  As  Explanation-based  Learning 
(DeJong  and  Mooney,  1986;  Mitchell,  Keller,  Kedar- 
Cabelli,  1986),  it  consists  of  two  phases,  explanation  and 
generalization.  However,  in  the  explanation  phase  the 
agent  is  not  building  a  proof  tree,  and  the  generalization  is 
not  a  deductive  one. 
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Figure  5:  The  rule  learning  method  of  Disciple 


In  the  explanation  phase,  the  educator  helps  the  agent  to 
understand  why  the  example  in  Figure  3  is  correct  (that 
is,  why  the  source  is  relevant  to  the  given  task).  The 
explanation  of  the  example  has  a  form  that  is  similar  to 
the  one  given  by  a  teacher  to  a  student:  the  source 
“Contended  Slaves  and  Masters”  is  relevant  to  the  given 
task  (see  Figure  3)  because  it  illustrates  a  slave  dance 
which  was  a  component  of  slave  culture,  and  it  was 
created  during  the  pre  Civil  War  period  which  was  before 
the  Civil  War  period.  Each  of  these  phrases  corresponds 
to  a  path  in  the  agent’s  ontology,  as  shown  in  Figure  6. 
However,  rather  than  giving  an  explanation  to  the  agent, 
the  educator  guides  it  to  propose  explanations  and  then 
selects  the  correct  ones.  For  instance,  the  educator  may 
point  to  the  most  relevant  objects  from  the  input  example 
and  may  specify  the  types  of  explanations  to  be  generated 
by  the  agent  (e.g.  a  correlation  between  two  objects  or  a 
property  of  an  object).  The  agent  uses  such  guidance  and 
specific  heuristics  to  propose  plausible  explanations  to  the 
educator  who  has  to  select  the  correct  ones.  A  particularly 
useful  heuristic  is  to  propose  explanations  of  an  example 
by  analogy  with  the  explanations  of  other  examples. 
Notice  that  the  above  explanation  is  similar  to  a  part  cf 
the  explanation  from  the  test  question  in  Figure  1.  This 
illustrates  a  significant  benefit  to  be  derived  from  using 
the  Disciple  approach  to  build  educational  agents.  That 
is,  the  kind  of  explanations  that  the  agent  gives  to  the 
students  are  similar  to  the  explanations  that  the  agent 
itself  has  received  from  the  educator.  Therefore,  the  agent 
acts  as  an  indirect  communication  medium  between  the 
educator  and  the  students. 


In  the  generalization  phase  (see  Figure  5),  the  agent 
performs  an  analogy-based  generalization  of  the  example 
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Figure  6:  The  explanation  of  the  example  in  Figure  3 


and  its  explanation  into  a  plausible  version  space  (PVS) 
rule.  A  PVS  rule  is  an  IF-THEN  rule  with  two  conditions,  a 
plausible  upper  bound  condition  that  is  likely  to  be  more 
general  than  the  exact  condition,  and  a  plausible  lower 
bound  condition  that  is  likely  to  be  less  general  than  the 
exact  condition.  The  generalization  process  is  illustrated 
in  Figure  7.  The  initial  example  is  the  internal  represen¬ 
tation  of  the  example  in  Figure  3.  Also,  the  explanation  is 
the  one  from  Figure  6.  First,  the  explanation  is  genera¬ 
lized  to  an  analogy  criterion  by  preserving  the  object 
features  (such  as  ILLUSTRATES  and  CREATED-DURING)  and 
by  generalizing  the  objects  to  more  general  concepts  (e.g. 
generalizing  SLAVE-DANCE  to  HISTORICAL-CONCEPT).  To 
determine  how  to  generalize  an  object.  Disciple  malyzes 
all  the  features  from  the  example  and  the  explanation  that 
are  connected  to  that  object.  Each  such  feature  is  defined 
in  Disciple’s  ontology  by  a  domain  (that  specifies  the  set 
of  all  the  objects  from  the  application  domain  that  may 
have  that  feature)  and  a  range  (that  specifies  all  the 
possible  values  of  that  feature).  The  domains  and  the 
ranges  of  these  features  restrict  the  generalizations  of  the 
objects.  For  instance,  in  the  explanation  from  Figure  7, 
SLAVE-DANCE  has  the  feature  component-OF  and  appears 
as  value  of  the  feature  ILLUSTRATES.  Therefore,  the  most 
general  generalization  of  SLAVE-DANCE  is  the  intersection 
of  the  domain  of  COMPONENT-OF  and  the  range  cf 
ILLUSTRATES,  which  is  HISTORICAL-CONCEPT. 

The  analogy  criterion  and  the  example  are  used  to 
generate  the  plausible  upper  bound  condition  of  the  rule, 
while  the  explanation  and  the  example  are  used  to 
generate  the  plausible  lower  bound  condition  of  the  rule. 
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Figure  7  ;  Generation  of  initial  plausible  version  space  rule 


4.2  RULE  REFINEMENT 

The  representation  of  the  PVS  rule  in  the  right  hand  side  cf 
Figure  5  shows  the  most  likely  relation  between  the 
plausible  lower  bound,  the  plausible  upper  bound  and  the 
hypothetical  exact  condition  of  file  rule.  Notice  that  there 
are  instances  of  the  plausible  upper  bound  that  are  not 
instances  of  the  hypothetical  exact  condition  of  the  rule. 
This  means  that  the  learned  rule  in  Figure  7  covers  also 
some  negative  examples.  Also,  there  are  instances  of  the 
hypothetical  exact  condition  that  are  not  instances  of  the 
plausible  upper  bound.  This  means  that  the  plausible 
upper  bound  does  not  cover  all  the  positive  examples  cf 
the  rule.  Both  of  these  situations  are  a  consequence  of  the 
fact  that  the  explanation  of  the  initial  example  might  be 
incomplete,  and  are  consistent  with  what  one  would 
expect  from  an  agent  performing  analogical  reasoning.  To 
improve  this  rule,  the  agent  will  use  the  rule  refinement 
method  represented  schematically  in  Figure  8.  The  agent 
will  use  the  learned  rule  to  generate  examples  similar  with 
the  one  in  Figure  3.  Each  such  example  is  covered  by  the 
plausible  upper  bound  and  is  not  covered  by  the  plausible 
lower  bound  of  the  rule.  The  example  is  shown  to  the 
educator  who  is  asked  to  accept  it  as  correct  or  to  reject  it, 
thus  characterizing  it  as  a  positive  or  a  negative  example 
of  the  rule.  A  correct  example  is  used  to  generalize  the 
plausible  lower  bound  of  the  rule’s  condition  through 
empirical  induction.  An  incorrect  example  is  used  to  elicit 
additional  explanations  from  the  educator  and  to  specialize 
both  bounds,  or  only  the  upper  bound. 

Figure  9  shows  an  example  generated  by  the  agent,  by 
analogy  with  the  initial  example  in  Figure  3.  The  agent’s 
analogical  reasoning  is  represented  in  Figure  10.  The 
explanation  from  the  left  hand  side  indicates  why  the 
initial  example  is  correct.  The  expression  from  its  right 
side  is  similar  with  this  explanation  because  both  of  them 
are  less  general  than  the  analogy  criterion  from  the  top  cf 
Figure  10.  Therefore,  one  may  infer  by  analogy  that  the 
similar  explanation  from  the  right  hand  side  of  Figure  10 
explains  an  example  (the  generated  example  from  the  right 
hand  side  of  Figure  10  and  from  Figure  9)  that  is  similar 
to  the  initial  example.  Nevertheless,  the  generated 
example  is  incorrect  and  was  rejected  by  the  educator. 


Figure  8;  The  rule  refinement  method  of  Disciple 
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!1  Current  Example 


If  you  are  a  writer  FOR  World  Wide  Web 
DURING  the  Civil  War  period 
(1861  -  1865)  and  you  have  been 
assigned  to  write  and  illustrate  a  feature 
article  on  negative  master  slave 
relationships,  then  the  HISTORICAL 
SOURCE  'Fugitive  Slaves'  is  relevant. 


Figure  9;  An  example  generated  by  the  agent" 


In  such  a  case  the  agent  needed  to  understand  why  this 
example,  which  was  generated  by  analogy  with  a  correct 
example,  is  wrong.  By  comparing  the  two  examples,  the 
educator  and  the  agent  were  able  to  find  out  that  the 
generated  example  is  wrong  because  the  world-wide- 
WEB  was  not  issued  during  the  civil-war  period.  On  the 
contrary,  the  initial  example  was  correct  because 
SOUTHERN-ILLUSTRATED-NEWS  was  issued  during  the 
CIVIL-WAR  period.  This  explanation  is  used  to  specialize 
both  bounds  of  the  version  space.  This  process  will 
continue  until  either  the  two  bounds  of  the  rule  become 
identical  or  until  no  further  examples  can  be  generated  that 
are  not  already  covered  by  the  plausible  lower  bound.  The 
final  rule  is  the  one  from  Figure  4.  This  training  phase 
continued  until  54  relevancy  rules  were  learned. 
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Figure  10:  Analogical  reasoning  in  Disciple 
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5  THE  TEST  GENERATION  ENGINE 

One  of  the  agent’s  requirements  was  that  it  generates  not 
only  test  questions,  but  also  feedback  for  right  and  wrong 
answers,  hints  to  help  the  student  in  solving  the  tests,  as 
well  as  explanations  of  the  solutions.  Moreover,  agent’s 
messages  needed  to  be  expressed  in  a  natural  language 
form.  Although  the  rules  learned  by  the  agent  contain 
almost  all  the  necessary  information  to  achieve  these 
goals,  some  small  adjustments  were  necessary.  In  the  case 
of  the  rule  in  Figure  4,  the  educator  needed  to  define  the 
templates  for  the  Hint,  Right  Answer  and  Wrong  Answer, 
shown  in  Figure  11.  The  Hint  in  Figure  1 1  is  the  part  of 
the  Explanation  in  Figure  4  that  refers  only  to  the 
variables  used  in  the  formulation  of  the  test  question.  The 
Right  Answer  in  Figure  1 1  is  generated  from  the 
Operation  Description  and  the  Explanation  in  Figure  4, 
and  the  Wrong  Answer  is  a  fixed  text. 

Hint.  To  dclennine  if  this  source  is  relevant  to  your  task  investigate  if  it 
illustrates  some  component  of  ?S2,  check  when  was  it  created,  and  when 
was  issued 

Right  Answer:  The  source  7S3  is  relevant  to  your  task  because  it  illustrates 
?S4  which  was  a  component  of  ?S2,  ?S3  was  created  during  ?P2  which  was 
before  ?PI  and  ?S1  was  issued  during  ?P1. 

Wrong  Answer:  Investigate  this  source  further  and  analyze  the  hints  and 
explanations  to  improve  your  understanding  of  relevance.  You  may  consider 
reviewing  the  material  on  relevance  Then  continue  testing  yourself 

Figure  1 1 :  Additional  templates  for  the  rule  in  Figure  4 

The  learned  rules  can  be  used  to  generate  different  types  of 
tests.  In  the  current  version  of  the  agent  we  have  chosen  to 
develop  a  test  generation  engine  that  can  generate  the 
following  four  classes  of  test  questions: 

•IF  RELEVANT:  Show  the  student  a  writing  assignment 
and  ask  whether  a  particular  historical  source  is  relevant 
to  that  assignment; 

•WHICH  RELEVANT:  Show  the  student  a  writing 
assignment  and  three  historical  sources  and  ask  the 
student  to  identify  the  relevant  one; 

•WHICH  IRRELEVANT:  Show  the  student  a  writing 
assignment  and  three  historical  sources  and  ask  the 
student  to  identify  the  irrelevant  one;  and 

•WHY  RELEVANT:  Show  the  student  a  writing  assign¬ 
ment,  a  source  and  three  possible  reasons  why  the  source 
is  relevant,  and  ask  the  student  to  select  the  right  reason. 

Similar  questions  could  be  generated  for  other  evaluation 
skill  such  as  IF  CREDIBLE  or  WHY  CREDIBLE  test  questions. 

To  generate  an  IF  relevant  test  question  with  a  relevant 
source,  the  agent  simply  needs  to  generate  an  example  of  a 
relevancy  rule.  This  rule  example  will  contain  a  task  T 
and  a  source  S  relevant  to  it,  together  with  one  hint  and 
one  explanation  that  will  indicate  one  reason  why  S  is 
relevant  to  T.  However,  if  the  student  requires  all  the 
possible  reasons  for  why  the  source  S  is  relevant  to  T, 
then  the  agent  will  need  to  find  all  the  examples 
containing  the  source  S  and  the  task  T  of  all  the  relevancy 
rules  from  the  family  of  rules  corresponding  to  T. 
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To  generate  an  IF  RELEVANT  test  question  with  m 
irrelevant  source,  the  agent  has  first  to  generate  a  valid 
task  T  by  finding  an  example  of  a  relevancy  rule  R.  Then 
it  has  to  find  a  historical  source  S  such  that  the  task  T 
and  the  source  S  are  not  part  of  an  example  of  any  mle 
from  the  family  of  rules  corresponding  to  the  task  T. 

The  methods  for  generating  WHICH  RELEVANT  and  WHICH 
IRRELEVANT  test  questions  are  based  on  the  methods  fir 
generating  IF  RELEVANT  test  questions. 

For  an  WHY  RELEVANT  test  question  an  example  Ei  of  a 
relevancy  rule  Ri  is  generated.  This  example  provides  a 
correct  task  description  T,  a  source  S  relevant  to  T,  and  a 
correct  explanation  EXi  of  why  the  source  S  is  relevant  to 
T.  Then  the  agent  chooses  another  rule  that  is  not  fiiom 
the  family  of  the  relevancy  rules  corresponding  to  T.  This 
rule  could  be  fi-om  another  family  of  relevancy  rules,  or 
could  be  a  rule  corresponding  to  another  evaluation  skill, 
for  instance  credibility  or  accuracy.  Let  us  suppose  that 
the  agent  chooses  a  credibility  rule  R2.  It  then  generates 
an  example  E2  of  R2,  based  on  Ei  (that  is,  E2  and  Ei  share 
as  many  parts  as  possible,  including  the  source  S).  The 
agent  also  generates  an  explanation  EX2  of  why  S  is 
credible.  While  this  explanation  is  correct,  it  has  nothing 
to  do  with  why  S  is  relevant  to  T.  Then,  the  agent 
repeats  this  process  to  find  another  explanation  that  is  true 
but  explains  something  else,  not  why  S  is  relevant  to  T. 

It  should  be  noticed  that,  when  the  agent  has  to  choose  an 
element  from  a  set,  the  choice  is  done  at  random.  Thus, 
its  behavior  is  different  from  one  execution  to  another. 

6  EXPERIMENTAL  RESULTS 

The  ontology  of  the  test  generation  agent  includes  the 
description  of  252  historical  concepts,  80  historical 
sources,  and  6  publications.  The  knowledge  base  also 
contains  54  relevancy  rules  grouped  in  four  families,  each 
family  corresponding  to  one  type  of  reporter  task.  These 
rules  have  been  learned  from  an  average  of  2.17 
explanations  (standard  deviation  0.91)  and  5.4  examples 
(standard  deviation  1.37). 

There  are  40,930  instances  of  the  54  relevancy  rules  in  the 
knowledge  base.  Each  such  instance  corresponds  to  an  IF 
RELEVANT  test  question  where  the  source  is  relevant.  In 


principle,  for  each  such  test  question  the  agent  can 
generate  several  IF  RELEVANT  test  questions  where  the 
source  is  not  relevant,  as  well  as  several  WHY  RELEVANT, 
WHICH  RELEVANT  and  WHICH  IRRELEVANT  test 
questions.  Therefore,  the  agent  can  generate  more  than  10 
different  test  questions. 

We  have  performed  four  types  of  experiments  with  the  test 
generation  agent.  The  first  experiment  tested  the 
correctness  of  the  knowledge  base,  as  judged  by  the 
domain  expert  who  developed  the  agent.  This  was 
intended  to  clarify  how  well  the  developed  agent 
represents  the  expertise  of  the  teaching  expert.  The  second 
experiment  tested  the  correctness  of  the  knowledge  base, 
as  judged  by  a  domain  expert  who  was  not  involved  in  its 
development.  This  was  intended  to  test  the  generality  rf 
the  agent,  given  that  assessing  relevance  is,  to  a  certain 
extent,  a  subjective  judgment.  The  third  and  the  fourth 
experiments  tested  the  quality  of  the  test  generation  agent, 
as  judged  by  students  and  by  teachers. 

The  results  of  the  first  two  experiments  are  summarized  in 
Table  1 .  To  test  the  predictive  accuracy  of  the  knowledge 
base,  406  IF  RELEVANT  test  questions  were  randoinly 
generated  by  the  agent  and  answered  by  the  developing 
expert.  We  have  performed  a  similar  experiment  with  a 
domain  expert  who  was  not  involved  in  the  development 
of  the  agent.  This  independent  expert  has  answered 
another  401  randomly  generated  IF  relevant  test 
questions.  These  experiments  have  revealed  a  much 
higher  predictive  accuracy  in  the  case  of  IF  relev A.NT  test 
questions  where  the  source  was  relevant.  This  was 
96.53%  in  the  case  of  the  developing  expert  and  95.45% 
in  the  case  of  the  independent  expert.  The  predictive 
accuracy  in  the  case  of  irrelevant  sources  was  only  81.86% 
in  the  case  of  the  developing  expert  and  76.35%  in  the 
case  of  the  independent  expert.  To  confirm  these  results 
we  have  conducted  an  additional  experiment  with  the 
independent  expert,  who  was  shown  other  1,326  IF 
relevant  test  questions  where  all  the  sources  were 
relevant  (for  a  total  of  1,524  such  questions).  In  this  case 
the  predictive  accuracy  of  the  agent  was  96.19%. 

We  have  analyzed  in  detail  each  case  where  both  the 
developing  expert  and  the  independent  expert  agreed  that 
the  agent  failed  to  recognize  that  a  source  was  relevant  or 


Table  1;  Evaluation  results 
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Figure  12:  Student  survey  results 


irrelevant  to  a  certain  task.  In  most  cases  it  was  concluded 
that  the  representation  of  the  source  was  incomplete.  This 
analysis  suggested  that  the  representation  of  the  sources 
should  be  guided  by  the  following  principle  which,  if 
followed,  would  have  avoided  many  of  the  agent’s  enrors: 
Any  historical  source  must  be  completely  described  in 
terms  of  the  concepts  from  the  knowledge  base.  This 
means  that  if  the  knowledge  base  contains  a  certain 
historical  concept,  then  any  historical  source  referring  to 
that  concept  should  contain  the  concept  in  the  description 
of  its  content.  Operationally,  this  simply  means  that  if  the 
expert  decides  to  describe  a  new  source  in  terms  of  some 
new  concept  C,  then  the  expert  has  to  review  again  the 
descriptions  of  each  source  S  from  the  knowledge  base,  ff 
the  experts  decides  that  S  refers  to  C,  then  she  or  he  has 
to  include  C  in  the  representation  of  S.  This  does  not 
mean,  however,  that  the  contents  of  the  historical  sources 
have  to  be  completely  described  (a  task  that  would  be 
very  hard,  especially  for  pictures). 

There  were  several  cases  where  the  two  experts  disagreed 
themselves,  mainly  because  the  independent  expert  had  a 
broader  interpretation  of  some  general  terms  (such  as  slave 
culture,  activities  related  to  slavery,  cruelty  of  slavery,  and 
master  slave  relationships)  than  the  developer  of  the 
knowledge  base.  However,  the  independent  expert  agreed 
that  someone  else  could  have  a  more  restricted 
interpretation  of  those  terms,  and,  in  such  a  case,  the 
answers  of  the  agent  could  be  considered  correct.  There 
were  also  5  cases  where  the  independent  expert  disagreed 
with  the  agent  and  then,  upon  flirther  analysis  of  the  test 
questions,  agreed  with  the  agent. 

Table  1  indicates  also  the  evaluation  time  because,  unlike 
the  automatic  learning  systems,  the  interactive  learning 
systems  require  significant  time  from  domain  experts,  and 
this  factor  should  be  taken  into  consideration  when 
developing  such  systems.  First  of  all,  one  could  notice 
that  it  took  twice  as  long  to  the  independent  expert  to 


analyze  401  test  questions  than  it  took  to  the  developing 
expert.  This  is  because  the  independent  expert  was  not 
familiar  with  any  of  the  80  historical  sources  used  in  the 
questions,  and  he  had  to  analyze  each  of  them  in  detail  in 
order  to  answer  the  questions.  However,  once  the 
independent  expert  became  familiar  with  the  sources,  he 
answered  the  new  1,326  test  questions  much  faster. 

We  have  also  conducted  an  experiment  with  a  class  of  21 
students  from  the  8th  grade  at  The  Bridges  Academy  in 
Washington  D.C.  The  students  were  first  given  a  lecture 
on  relevance  and  then  were  asked  to  answer  25  test 
questions  that  were  dynamically  generated  by  the  agent. 
Students  were  also  asked  to  investigate  the  hints  and  the 
explanations.  To  record  their  impressions,  they  were 
asked  to  respond  to  a  set  of  1 8  survey  questions  with  one 
of  the  following  phrases:  very  strongly  agree,  strongly 
agree,  agree,  indifferent,  disagree,  strongly  disagree,  and 
very  strongly  disagree.  Figure  12  presents  the  results 
from  7  of  the  most  informative  survey  questions. 

Finally,  a  user  group  experiment  was  conducted  with  8 
teachers  at  The  Public  School  330  in  the  Bronx,  New 
York  City.  This  group  of  teachers  had  the  opportunity  to 
review  the  performance  of  the  agent  and  was  then  asked  to 
complete  a  questionnaire.  Several  of  the  most  informative 
questions  and  a  summary  of  the  teacher’s  responses  are 
presented  in  Figure  13. 

7  CONCLUSIONS 

In  this  paper  we  have  presented  an  innovative  application 
of  the  Disciple  Shell  to  the  building  of  a  test  generation 
agent.  We  have  provided  experimental  evidence  that  the 
process  of  teaching  the  agent  is  natural  and  efficient,  and 
that  it  results  in  a  knowledge  base  of  good  quality  and  in 
a  useful  educational  agent.  Since  the  agent  is  taught  by 
the  educator  through  examples  and  explanations,  and  then 
it  is  able  to  provide  similar  examples  and  explanations  to 
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Figure  13;  Teacher  survey  results 
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the  students,  it  could  be  considered  as  being  a  preliminary 
example  of  a  new  type  of  educational  agent  that  can  be 
taught  by  an  educator  to  teach  the  students  (Hamburger 
and  Tecuci,  1998).  From  the  point  of  view  of  the  artificial 
intelligence  research,,  this  work  shows  an  integration  cf 
machine  learning  and  knowledge  acquisition  with 
problem  solving  and  intelligent-tutoring  systems.  From 
the  point  of  view  of  the  education  research,  it  shows  an 
automated  computer-based  approach  to  the  assessment  cf 
higher-order  thinking  skills,  as  well  as  an  assessment  that 
involves  multimedia  documents.  Future  work  involves 
further  development  of  the  agent  and  its  experimental  use 
in  the  classroom.  We  are  also  continuing  the  development 
of  the  Disciple  approach  and  are  applying  it  to  other 
challenging  problems,  such  as  building  a  statistical  ana¬ 
lysis  assessment  and  support  agent,  and  an  agent  who  has 
to  find  the  best  way  of  working  around  various  damages 
to  an  infrastructure,  such  as  a  damaged  bridge  or  tuimel. 
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Abstract 

Many  systems  that  learn  from  examples  express 
the  learned  concept  as  a  disjunction.  Those 
disjuncts  that  cover  only  a  few  examples  are 
referred  to  as  small  disjuncts.  The  problem  with 
small  disjuncts  is  that  they  have  a  much  higher 
error  rate  than  large  di.sjuncts  but  are  necessary  to 
achieve  a  high  level  of  predictive  accuracy.  This 
paper  investigates  the  effect  of  noise  on  small 
disjuncts.  In  particular,  we  show  that  when  noise 
is  added  to  two  real-world  domains,  a  significant, 
and  disproportionate  number  of  the  total  errors 
are  contributed  by  the  small  disjuncts;  thus,  in 
the  presence  of  noise,  it  is  the  small  disjuncts  that 
are  primarily  responsible  for  the  poor  predictive 
accuracy  of  the  learned  concept. 

1  INTRODUCTION 

Systems  that  learn  from  examples  often  express  the 
learned  concept  as  a  disjunction.  The  coverage,  or  size, 
of  each  disjunct  is  defined  as  the  number  of  training 
examples  that  it  correctly  classifies  (Hoke,  Acker  & 
Porter,  1989).  Small  disjuncts  are  those  disjuncts  that 
cover  only  a  few  training  examples.  Although  small 
disjuncts  may  individually  cover  only  a  small  fraction  of 
the  training  examples,  collectively  they  can  cover  a 
significant  percentage  of  the  training  examples.  The 
problem  with  small  disjuncts  is  that  they  have  a  higher 
error  rate  than  large  disjuncts  but  cannot  be  eliminated 
without  greatly  reducing  the  predictive  accuracy  of  the 
learned  concept. 

Early  work  on  small  disjuncts  investigated  a  variety  of 
issues,  including  ways  of  improving  predictive  accuracy 
by  eliminating  some  small  disjuncts  (Holte,  et  al.,  1989; 
Quinlan,  1991).  Danyluk  and  Provost  (1993)  highlighted 
the  role  of  small  disjuncts  in  learning  from  noisy  data 
when  they  speculated  that  in  the  telecommunication 
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domain  they  were  studying,  learning  from  noisy  data  was 
hard  due  to  a  difficulty  distinguishing  between  systematic 
noise  and  "true"  exceptional  cases  in  the  training  data. 
True  exceptions  and  small  di.sjuncts,  although  similar 
entities  which  are  sometimes  used  interchangeably,  differ 
in  one  important  way — true  exceptions  are  defined 
relative  to  the  "true"  (i.e.,  correct)  concept  whereas  small 
disjuncts  are  defined  relative  to  a  learned  concept.  Weiss 
(1995)  investigated  the  interaction  of  noise  on  true 
exceptions  by  using  artificial  datasets  and  demonstrated 
that  this  interaction  results  in  error  prone  small  disjuncts 
in  the  learned  concept.  In  this  paper  we  focus  on  small 
disjuncts  rather  than  "true  exceptions"  because  for  the 
real  world  domains  we  use,  the  "correct"  concept 
definition  is  not  known,  and  hence  it  is  not  possible  to 
measure  the  true  exceptions. 

This  paper  extends  previous  work  by  examining  the 
effect  of  noise  on  small  disjuncts  using  real-world 
data.sets  and  assessing  the  impact  of  this  effect  on  the 
overall  learning  process.  In  particular,  we  show  that 
when  noise  is  added  to  these  datasets,  then  the  concept 
learned  from  this  data  exhibits  the  problem  with  noise 
and  small  disjuncts:  that  is,  the  small  disjuncts  contribute 
a  disproportionate,  and  significant,  number  of  the  total 
errors  (relative  to  the  number  of  examples  they  cover)  but 
still  cannot  be  eliminated  without  adversely  affecting  the 
accuracy  of  the  learned  concept.  Thus,  we  show  that  the 
small  disjuncts  are  primarily  responsible  for  learning 
being  difficult  in  the  presence  of  noi.se. 

2  DESCRIPTION  OF  EXPERIMENTS 

This  section  describes  the  learning  program,  problem 
domains  and  experimental  methodology  we  used  to 
conduct  our  experiments. 

2,1  THE  LEARNER 

All  of  the  experiments  described  in  this  paper  use  C4.5,  a 
program  for  inducing  decision  trees  from  preclassified 
training  examples  (Quinlan,  1993).  C4.5  was  chosen 
because  it  is  a  popular  tool  for  learning  disjunctive 
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concepts  and  because  we  were  able  to  modify  it,  without 
too  much  difficulty,  to  collect  statistics  relating  to 
disjunct  size.  For  the  majority  of  experiments,  C4.5  was 
run  in  one  of  the  following  two  configurations; 

—  with  its  default  parameters  and  pruning  strategy,  and 

—  with  its  default  parameters  but  without  any  pruning 
and  with  the  -ml  option  to  disable  the  default 
stopping  criterion. 

The  -m  option  stops  a  node  from  being  split  during  the 
tree-building  process  if  the  resulting  node  covers  fewer 
than  the  specified  number  of  examples  (1  in  this  case). 
Thus,  in  the  second  configuration,  C4.5  will  build  a 
decision  tree  that  correctly  classifies  all  training  examples 
if  the  examples  are  consistent. 

2.2  THE  PROBLEM  DOMAINS 

This  paper  uses  the  KPa7KR  chess  endgame  (Shapiro, 
1987)  and  Wisconsin  breast  cancer  (Wolberg,  1990) 
datasets,  which  were  obtained  from  the  UCI  repository  of 
machine  learning  databases  (Murz  &  Murphy,  1998). 
These  datasets  were  selected  because  C4.5  was  able  to 
attain  high  levels  of  predictive  accuracy  on  them;  we 
wanted  to  come  as  close  to  learning  the  correct  target 
concept  as  possible  prior  to  the  introduction  of  artificial 
noise.  The  KPa7KR  dataset  contains  3196  examples  with 
36  attributes,  where  each  example  represents  a  board 
position  and  has  the  class  value  "won"  or  "nowin".  The 
Wisconsin  breast  cancer  dataset  contains  699  examples 
with  nine  attributes,  with  each  example  having  the  value 
"benign"  or  "malignant".  The  class  distribution  is 
approximately  equal  for  the  chess  endgame  domain  and  is 
2:1  in  favor  of  the  benign  class  for  the  breast  cancer 
domain.  The  results  for  the  breast  cancer  domain  closely 
parallel  those  for  the  chess  domain  and  therefore  in  most 
cases  we  only  display  the  results  for  the  chess  domain  (all 
results  are  for  the  chess  domain  unless  noted  otherwise). 

2.3  EXPERIMENTAL  METHODOLOGY 

For  each  experiment  seven  independent  runs  were 
performed  and  the  results  averaged  together.  For  each 
run,  200  examples  were  randomly  selected  and  placed 
into  the  training  set  while  the  remaining  examples  were 
placed  into  the  test  set.  Unless  stated  otherwise,  all 
measurements  are  based  on  the  performance  of  the  test 
set.  Varying  levels  of  randomly  generated  class  noise  are 
used  in  the  experiments.  The  examples  are  considered 
initially  noise-free.  A  noise  level  of  n%  means  that  with 
probability  n/100  the  class  value  is  randomly  selected 
from  the  remaining  alternatives.  This  means  that  when 
50%  class  noise  is  applied  to  a  domain  with  two  classes, 
there  is  no  information-provided  by  the  class  variable. 

For  the  experiments  performed  in  this  paper,  coverage  is 
defined  in  terms  of  the  number  of  test  examples  correctly 
classified,  since  we  felt  that  this  would  yield  a  more  fair 
measure  of  the  true  coverage  of  each  disjunct  (just  as 
measuring  accuracy  on  the  test  set  yields  a  more  fair 


measure).  However,  we  do  not  believe  this  decision  to  be 
critical.  For  each  graph  presented  in  this  paper,  coverage 
is  displayed  on  a  logarithmic  scale,  so  the  behavior  of  the 
small  disjuncts  can  be  easily  identified. 


3  THE  PROBLEM  WITH  SMALL 
DISJUNCTS 


Although  the  focus  of  this  paper  is  on  the  problem  with 
noise  and  small  disjuncts,  this  section  will  first  show  that 
the  chess  endgame  and  breast  cancer  domains  exhibit  the 
problem  with  small  disjuncts.  Figures  1  and  2  show  the 
results  of  running  C4.5  on  the  chess  endgame  and 
Wisconsin  breast  cancer  domains,  respectively,  without 
any  artificial  noise  applied  to  the  datasets.  For  these 
figures,  and  for  all  figures  in  this  paper  with  coverage  on 
the  x-axis,  the  value  of  each  curve  at  coverage  n  is  based 
on  the  collective  performance  of  all  the  disjuncts  with 
coverage  less  than  or  equal  to  n.  Thus,  the  curves  labeled 
"Examples"  and  "Errors"  in  Figures  1  and  2  show  the 
percentage  of  total  examples  and  errors,  respectively, 
covered  by  these  disjuncts  (i.e.,  with  size  <  n)  when  the 
learned  concept  is  applied  to  the  test  set.  The  error  rate 
curve  shows  the  error  rate  of  the  disjuncts  with  size  <  n. 


Figure  1 :  The  Effect  of  Disjunct  Size  (Chess  Domain) 


Examples 
Errors 
Error  Rate 


Figure  2:  The  Effect  of  Disjunct  Size  (Cancer  Domain) 


An  example  will  help  clarify  the  meanings  of  these 
curves  and  demonstrate  that  small  disjuncts  are  "error 
prone".  In  Figure  1,  the  curves  for  errors  and  error  rate 
intersect  at  coverage  40.  The  curves  tell  us  that  the 
disjuncts  with  size  <  40  collectively  have  an  error  rate  of 
50%  and  collectively  cover  50%  of  the  total  errors,  but 
only  cover  5%  of  the  total  examples.  This  clearly 
demonstrates  that  small  disjuncts  are  error  prone  (i.e., 
they  cover  a  disproportionate  number  of  errors).  The 
error  rate  for  the  learner  as  a  whole  can  be  found  by 
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looking  at  the  error  rate  when  100%  of  the  errors  and 
examples  have  been  covered;  we  see  from  this  that  the 
overall  error  rate  for  the  chess  endgame  domain  is  5% 
and  the  overall  error  rate  for  the  breast  cancer  domain  is 
6%.  The  error  rate  curve  also  shows  that  small  disjuncts 
have  a  higher  error  rate  than  large  disjuncts,  since  the 
error  rate  decreases  (for  both  domains)  as  larger  disjuncts 
are  included  in  the  error  rate  calculations. 

Figures  1  and  2  show  that  most  examples  are  covered  by 
the  larger  disjuncts,  but  the  smaller  disjuncts  nonetheless 
coyer  a  large  percentage  of  the  examples.  This  is  more 
evident  for  the  breast  cancer  domain,  but  even  for  the 
chess  endgame  domain  disjuncts  of  size  <  100  are  much 
more  error  prone  than  the  larger  disjuncts  and  cover  about 
20%  of  the  total  examples.  These  results  are  consistent 
with  those  described  by  Holte  and  colleagues  (1989).  In 
addition,  since  the  small  disjuncts  cover  too  many 
examples  to  be  simply  dropped  from  the  learned  concept 
without  significantly  impacting  the  accuracy  of  the 
concept,  these  results  also  demonstrate  that  these  domains 
exhibit  the  problem  with  small  disjuncts. 


4  THE  PROBLEM  WITH  NOISE  AND 
SMALL  DISJUNCTS 


This  section  will  show  that  for  the  chess  and  breast 
cancer  domains,  noise  results  in  small  disjuncts  being 
mainly  responsible  for  the  errors  in  the  learned  concept. 
For  these  experiments,  no  pruning  is  done  unless 
specified  and  class  noise  is  applied  to  both  the  training 
and  test  sets. 

Figure  3  shows  what  happens  to  the  error  rate  as  the  noise 
rate  is  varied  (recall  that  for  coverage  of  n,  the 
"collective"  error  rate  is  based  on  all  disjuncts  with  size 
^).  The  figure  shows  that  the  addition  of  5%  class  noise 
causes  the  error  rate  for  small  disjuncts  to  increase,  but 
from  that  point  on  it  decreases  as  more  noise  is  added. 
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Figure  3;  Effect  of  Noise  on  Error  Rate 


To  make  it  easier  to  see  the  degree  to  which  errors  are 
concentrated  toward  the  small  disjuncts,  we  will  use  a 
statistic  called  the  error  factor,  first  introduced  by  Weiss 
(1995).  The  error  factor  is  defined  as: 


„  r,  ,  ,  %  cumulative  errors(cov) 

Error  Factor(cov)  s  — - i - 

%  cumulative  examples (cov) 


The  error  factor  is  a  function  of  coverage  and  is 
essentially  the  "Errors"  curve  divided  by  the  "Examples" 
curve.  For  example,  the  error  factor  at  coverage  40  in 
Figure  1  is  10  (50%/5%),  which  indicates  that  disjuncts 
with  size  <  40  contribute  10  times  more  errors  than 
expected  if  coverage  had  no  effect  on  error  rate. 


Figure  4,  which  plots  the  error  factor  versus  coverage, 
shows  the  effect  of  noise  on  small  disjuncts  even  more 
clearly  than  Figure  3,  since  error  factor  is  a  relative 
measure  which  takes  into  account  the  different  overall 
error  rates  resulting  from  learning  with  the  different 
levels  of  class  noise.  Figure  4  shows  that  as  the  amount 
of  noise  increases  the  error  factor  for  small  disjuncts 
decreases.  This  indicates  that  as  the  noise  level  increases 
either  the  percentage  of  errors  contributed  by  the  small 
disjuncts  decreases  and/or  the  percentage  of  examples 
covered  by  the  small  disjuncts  increases. 
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Figure  4:  Effect  of  Noise  on  Error  Factor 

Noise  added  to  the  training  data  will  undoubtably  affect 
the  concept  that  is  learned  and  will  therefore  affect  the 
small  disjuncts  in  the  learned  concept.  Figure  5  addresses 
this  by  showing  how  various  noise  levels  affect  the 
number  of  examples  covered  by  the  small  disjuncts. 
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Figure  5:  Effect  of  Noise  on  Distribution  of  Cases 


Figure  5  shows  that  as  more  noise  is  added  to  the  data, 
the  number  of  examples  covered  by  small  disjuncts 
increases  dramatically.  For  example,  disjuncts  of 
size  <100  cover  3  times  as  many  examples  when  the 
noise  level  increases  from  no  noise  to  10%  noi.se.  Figure 
5  confirms  what  we  and  others  had  suspected — that  noisy 
data  will  cause  a  learner  to  form  "erroneous"  small 
disjuncts. 
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Figure  6  shows  how  the  distribution  of  errors  changes  as 
noise  is  applied  to  the  domain.  It  shows  that  when  the 
noise  level  is  less  than  20%,  small  disjuncts  with  size  < 
30  account  for  an  even  greater  percentage  of  the  total 
errors  than  when  there  was  no  noise.  Thus,  we  now  have 
an  explanation  of  why  the  error  factor  in  Figure  4 
decreased  as  additional  noise  was  introduced  it  was 
because  the  number  of  examples  covered  by  the  small 
disjuncts  increased  at  a  faster  rate  than  the  number  of 
errors  contributed  by  these  disjuncts.  Note  that  once  the 
noise  level  reaches  30%,  then  disjuncts  with  coverage  < 
30  no  longer  cover  a  disproportionate  number  of  the 
errors — they  cover  half  of  the  errors  but  also  cover  almost 
half  of  the  total  examples.  The  breast  cancer  domain 
exhibits  similar  trends. 


in  such  cases  a  very  aggressive  overfitting  avoidance 
strategy  is  needed  to  adequately  learn  the  correct  concept. 
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Figure  7;  Effect  of  Pruning  on  Overall  Error  Rate 


5  UNDERSTANDING  THE  EFFECT  OF 
NOISE  ON  SMALL  DISJUNCTS 

In  the  experiments  described  in  the  previous  section,  the 
training  and  tests  sets  were  generated  from  the  same 
distribution.  While  this  is  the  most  realistic  scenario, 
when  one  is  trying  to  understand  the  effect  of  noise  on 
learning,  noise  is  frequently  only  applied  to  either  the 
training  or  test  set. 


Figure  6:  Effect  of  Noise  on  Distribution  of  Errors  g_  j  jjjj;  EFFECT  ON  TRAINING 


We  can  summarize  the  results  from  Figures  3-6  as 
follows:  in  the  presence  of  noise,  small  disjuncts  have  a 
higher  error  rate  than  large  disjuncts  and  cover  a 
significant  number  of  the  total  cases  and  total  errors.  As 
a  consequence,  small  disjuncts  contribute  a 
disproportionate  and  very  significant  number  of  the 
errors.  All  of  this  holds  true  until  very  high  levels  of 
noise  are  applied,  at  which  point  the  impact  of  noise  on 
the  large  disjuncts  becomes  important  relative  to  the 
impact  of  noise  on  small  disjuncts — at  which  point  small 
disjuncts  can  no  longer  be  blamed  for  the  poor 
performance  of  the  learned  concept. 

Since  overfitting  avoidance  strategies  such  as  pruning  are 
more  likely  to  eliminate  small  disjuncts  than  large 
disjuncts,  it  is  interesting  to  see  how  these  strategies  will 
affect  the  error  rate  and  how  this  can  be  related  to  the  role 
of  small  disjuncts.  Figure  7  shows  how  pruning  affects 
the  overall  error  rate.  Since  it  is  not  possible  to  predict 
random  class  noise,  the  optimal  error  rate  will  equal  Ae 
noise  rate.  This  figure  shows  that  the  default  pruning 
strategy  improves  the  error  rate  in  the  presence  of  class 
noise  and  improves  it  the  most  when  the  noise  rate  is 
between  10%  and  20%.  This  is  explained  by  the  fact  that 
in  this  range  the  small  disjuncts  have  very  high  error  rates 
(Figure  3)  and  contribute  a  very  large  percentage  of  the 
total  errors  (Figure  6).  The  strategy  which  uses  C4.5’s 
-m20  option  to  prevent  nodes  from  being  formed  when 
fewer  than  20  examples  are  covered  also  improves  the 
error  rate,  except  when  there  is  no  noise.  This  strategy 
also  outperforms  the  default  pruning  strategy  when  there 
are  very  high  levels  of  noise  (e.g.,  30%),  indicating  that 


Noise  applied  only  to  the  training  set  tests  the  ability  to 
learn  the  "correct"  concept  in  the  presence  of  noise 
(Quinlan,  1986).  That  is,  by  limiting  the  noise  to  the 
training  set,  we  can  evaluate  the  sensitivity  of  the  learner 
to  noise.  We  can  accomplish  this  evaluation,  even 
without  knowing  the  "correct"  concept,  by  using  the 
noise-free  test  data  to  approximate  the  correct  concept. 

As  shown  earlier,  noise  in  the  training  set  introduces 
additional  "erroneous"  small  disjuncts  into  the  learned 
concept.  Experiments  identical  to  those  described  earlier 
were  repeated  with  the  artificial  noise  restricted  to  the 
training  set.  Graphs  corresponding  to  those  shown  in 
Figures  3-6  were  generated.  The  results  indicated  that 
under  these  circumstances  small  disjuncts  have  an  even 
more  significant  impact  on  learning  and,  in  particular, 
contribute  a  greater  percentage  of  the  errors  than  when 
noise  was  applied  to  both  the  training  and  test  sets. 

5.2  THE  EFFECT  ON  TESTING 

It  is  also  meaningful  to  study  the  effect  of  noise  on  the 
test  set.  This  situation  corresponds  to  the  scenario  in 
which  the  training  data  is  "cleaned  up",  perhaps  by  using 
more  costly  measurement  equipment,  m  the  hope  of 
achieving  improved  predictive  accuracy.  Experiments  in 
which  the  noise  was  limited  to  the  test  set  were  run  and 
the  results  showed  that,  relative  to  the  case  where  noise 


^  However,  if  systematic  noise  is  applied  to  the  test  set,  better  predictive 
accuracy  may  be  obtained  by  leaving  the  noise  in  the  training  set. 
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was  applied  to  both  the  training  and  test  sets,  the  small 
disjuncts  had  much  less  of  a  negative  impact  on  learning. 

5.3  DISCUSSION 

The  results  described  in  the  previous  two  subsections  can 
be  explained  by  examining  how  noise  affects  small 
disjuncts.  First  of  all,  noise  in  the  training  set  will 
influence  the  concept  that  is  learned  but  noise  in  the  test 
set  cannot.  Since  small  disjuncts  are  based  on  tbe  learned 
concept,  we  can  conclude  that  noise  in  the  test  set  cannot 
cause  small  disjuncts  to  be  formed.  Futhermore,  noise  in 
the  test  set  will  tend  to  affect  all  disjuncts  equally  (Weiss, 
1995).  This  explains  why  the  effect  of  noise  on  small 
disjuncts  is  less  dramatic  when  noise  is  applied  to  both 
the  training  and  test  sets  than  when  it  is  limited  to  the 
training  set — in  the  former  case  noise  in  the  test  set 
reduces  the  relative  difference  in  error  rates  between  the 
small  and  large  disjuncts.  When  noise  is  applied  to  only 
the  test  set,  the  effect  is  greatly  diminished,  and  would 
disappear  completely  if  the  learner  were  able  to  learn  the 
correct  concept  prior  to  the  introduction  of  artificial  noise. 
For  a  more  in  depth  description  about  how  noise  affects 
small  disjuncts,  refer  to  Weiss  (1995). 

6  CONCLUSION 

This  paper  investigated  the  effect  of  noise  on  small 
disjuncts  and  how  this  effect  impacts  the  overall  learning 
process.  For  both  the  KPa7KR  chess  end-game  domain 
and  the  Wisconsin  breast  cancer  domain,  the 
experimental  results  in  this  paper  show  that  small 
disjuncts  are  responsible  for  learning  being  difficult. 
Only  at  very  high  levels  of  class  noise  do  the  large 
disjuncts  contribute  a  relatively  large  percentage  of  the 
total  errors.  This  paper  also  showed  some  trends  and 
effects  that  we  feel  are  likely  to  hold  for  learning  in 
general  and  not  just  for  the  two  domains  used  in  this 
paper.  In  particular,  we  feel  that  1)  noise  tends  to 
decrease  the  number  of  large  disjuncts  and  increase  the 
number  of  small  disjuncts  in  the  learned  concept,  2) 
relatively  low  levels  of  noise  will  increase  the  percentage 
of  errors  contributed  by  small  disjuncts,  but  this  effect 
will  diminish  as  higher  levels  of  noise  are  applied,  and  3) 
noise  in  the  test  set  has  an  equalizing  effect  which 
decreases  the  impact  of  the  small  disjuncts  on  learning. 

We  believe  these  results  are  important  because  they 
provide  some  insight  into  how  noise  affects  learning  and 
how  the  effect  of  noise  manifests  itself  in  the  learned 
concept.  Given  the  prevalence  of  noise  in  real-world 
problem  domains,  such  an  understanding  is  critical.  This 
work  also  provides  additional  justification  for  overfitting 
avoidance  strategies  and  hopefully  provides  some 
additional  insights  into  why  these  strategies  work,  how 
they  can  be  improved  and  the  limitations  of  such 
strategies. 
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